Deep Text Analytics with Knowledge Graphs and Machine Learning

In order to understand a text not only superficially, but also in its deeper meaning, not only single entities should be recognized without any connection, but the recognized entities have to be embedded into their contexts. Word embeddings based on Word2vec or Co-occurrence analyses are statistical methods of text corpus analysis that are often used to automatically calculate contexts of terms and phrases.

However, these methods quickly reach their limits when more complex background knowledge is required for text classification, which does not occur explicitly in the text corpus and therefore cannot be derived from it. Typical examples are medical texts, legal contracts or also technical documentation, which often have to be classified on the basis of if-then rules, whereby the conditions themselves are multidimensional. This complexity quickly leads to the fact that in a rule-based system that is to perform this task automatically, countless combinations of input parameters have to be stored and managed. This often leads to problems and errors during ongoing maintenance and any expansion of the system and rules.

Our method of semantic text analysis transforms all input data, including unstructured texts, into semantic knowledge graphs based on RDF. Using entity linking techniques based on NLP and ML methods, any text expressed as an RDF graph can be embedded into a larger context, a domain-specific knowledge graph. Using the Shapes Constraint Language (SHACL), a specification of the World Wide Web Consortium (W3C) for the validation of graph-based data on the basis of a series of conditions, those texts can then be determined automatically that correspond to an information need that was initially formulated in natural language.

Essential advantages of this approach are:

  • The modeling language and the rules are based on the semantics of the knowledge domain, which facilitates the integration of domain experts, since they rarely have knowledge of rule languages.
  • The graph-based approach relaxes the problem of the large number of combinatorial possibilities that are usually difficult to maintain in conventional rule languages.
  • Synergies are achieved by using knowledge graphs as the basis for text mining (entity extraction, entity linking and disambiguation) as well as for the formulation of rules based on corresponding RDF shapes.
  • An appropriate governance model can be implemented more easily because the overall task can be better broken down into individual steps. This allows semantic knowledge models to be formulated and built independently of the rules and shapes.
  • The method is completely based on open standards of the W3C. This helps users build a semantic AI infrastructure without the risk of becoming dependent on vendors.

We will discuss this approach based on several use cases, incl. the MATRIO Data Cleanup method."