New convergences between Natural language processing and knowledge engineering at the era of the Semantic Web.

Keynote

An illustration with the extraction and representation of semantic relations
 
Design the theories and systems required to process natural language (either to "understand" written or oral human language or to generate new text), or to perform reasoning and solve problems, have been two main research lines right from the beginning of AI. Today, they still form two separate research communities: Natural Language PRocessing (NLP) and Knowledge Engineering (KE). Although these activities are closely related in human cognition and activities, they have long been studied as two separate tasks that share a common background in knowledge representation and logic. Various limitations of knowledge based systems led AI research to investigate alternative approaches, and to privilege efficiency and user assistance when performing a task rather than cognitive likelihood and full autonomy of the systems. As a consequence, although early works in NLP and in KE relied on symbolic representations and algorithms strongly inspired by human behaviour, research studies are now involving a larger variety of techniques, including numeric machine learning, data analysis, constraint programming  and planning, so that intelligent systems no longer mimic human performance.
 
A parallel movement accompanies this evolution, which faces a (natural) stronger convergence between NLP and KE. On the one hand, analyzing language requires to represent what has been understood and to reproduce some of the reasoning and tasks carried out when dealing with language, like solving references, understanding the often implicit relations between sentences, disambiguating the meaning of words regarding their context, recognizing entities, opinions, inferring more knowledge about the situation etc. On the other hand, building knowledge bases acquired from human expertise relies on language, thich conveys most of the required knowledge. This process has a high cost that can be reduced by extracting part of this knowledge from text. Worse, in many situations, it may not be relevant to refer only to persons when part of the knowledge is either tacit or already explicit in documents. This convergence has been stimulated by the availability of larger sets of data, huge amounts of textual documents in companies and most of all on the web. Convergence increased with the advance of mining algorithms in information extraction on the one hand, and fine grain understanding of natural language brought by linguistics and mainly computational linguistics, where corpus-based approaches proved to be very efficient to better understand language. Other influences and advances from research in terminology and knowledge representation or machine learning draw a new landscape where the frontiers between these "disciplines" are blurred. Current intelligent applications require to look for information on the basis of its meaning, its content and meta-data, which combines information retrieval and language analysis, but also reasoning based on back-ground knowledge, which means an access to a large knowledge base. Knowledge rich applications require language understanding and production to explore more knowledge sources or to better interact with the user and so on. In short, the notion of "semantics" that is shared by the two research communities is central, and the outcome of the Semantic Web is a new opportunity to investigate these topics.
 
Indeed, the Semantic Web promotes to better use the large resources available on the web, in particular thanks to machine learning, data classification or clustering and other mining techniques. The Semantic web working groups integrated many previous work on knowledge representation and reasoning to define standard formats that make it easier to share, store, query, compare, link, analyse or process knowledge and data, whether it be textual or numerical. This context is favorable to strengthen the relations between the studies about language and knowledge, encompassing knowledge representation; knowledge storage and linking; building, sharing and reusing semantic resources; using language to query or to exchange data; learning knowledge from text and data to better understand, process or produce language ... 
 
The extraction of semantic relations is one of the research challenges in NLP and KE currently impacted by the Semantic Web. Semantic relations are a nice case-study not only because extracting semantic relations from text is still considered as a difficult task, but also because linking data on the Semantic Web has turned it into one of the important challenges to be addressed, and finally because the evolution of the extraction techniques is a nice example of the shift occurring when exploiting the web. These techniques evolved from linguistically grounded pattern based approaches matched on domain specific corpora to deep learning methods applied to the whole web. An unexpected feature of this impact is a shift in the disciplines collaborating to this field: whereas NLP and KE researchers used to collaborate with linguists and terminologists, they now work together with machine learning researchers and mathematicians. So far, there is not yet a cross-disciplinary fertilization of these two trends. For instance, the large efforts carried out in terminology and in computational linguistics to evaluate, define, improve pattern-based approaches are not much taken into account in machine learning for NLP or ontology learning. In a symmetric manner, relation extraction using machine learning has still little impact on linguistic analyses and when designing lexical resources. 
 
In my talk, I will first sketch the historical relations between NLP and KE, highlighting research challenges at their confluence. I will then focus on semantic relations to illustrate how the vision, formats and approaches proposed by the Semantic Web impact several dimensions of this research topic: their extraction from text, their representation in knowledge bases and lexical resources as well as their use for richer semantic annotations. Thus I will draw an overview of the current approaches to identify semantic relations in text, with a special focus on the ways pattern-based solutions have evolved since early works, on machine learning methods and on the complementarity of various techniques to support this task. I will then report several representation models for semantic relations, that associate the conceptual part with linguistic information and lexical entries. I will finally present some models for a rich annotation including relations.  
I will conclude by stressing the need to capitalize better all the experiments and tools developed up to now, in particular by sharing not only representations and data, but also patterns and learning methods, and also by investigating more systematically how existing techniques can be used together in a single platform, and mutually benefit of each other's results.  

Speakers: