“The Less Is More” for Text Classification

Poster & Demo

Nowadays, Text Classification [2,5] is gaining more attention due to the availability of a huge number of text data, such as blog articles and news data.
Traditional text classification methods [1], use all the words present in a given
text to represent a document. However, the high number of words mentioned in
documents can tremendously increase the complexity of the classification task
and subsequently make it very costly. Moreover, long (natural language text)
documents usually include a different variety of information related to the topic
of a document. Especially, encyclopedic articles such as a life of a scientist,
contain detailed information about a person. Usually, in such articles after the
first paragraph (or first a few sentences), words or entities tend to appear which are not related to the main topic (or category 4 ) of the article. We assume that the most informative part of such articles is a few starting sentences. In other words, instead of considering the complete document, only the beginning of it can be exploited to classify a document accurately.
In this study, we design a Knowledge Based Text Classification method which
is able to classify a document by using only a few starting sentences of the
article. Since the length of the considered text is rather limited, ambiguous words might lead to inaccurate classification results. Therefore, instead of words, we consider entities to represent a document. In addition, entities and categories are embedded into a common vector space which allows capturing the semantic similarity between them. Moreover, the approach does not require any labeled data as a prerequisite. Instead, it relies on the semantic similarity between a set of predefined categories and a given document to determine which category the given document belongs to.

Speakers:

Harald Sack

Senior Researcher

FIZ Karlsruhe – Leibniz Institute for Information Infrastructure
https://www.fiz-karlsruhe.de/

Harald Sack is Professor of Information Service Engineering at FIZ Karlsruhe, Leibniz Institute for Information Infrastructure and Karlsruhe Institute of Technology (KIT). After graduating in computer science at the University of the Federal Forces Munich, he worked as network engineer and project manager in the signal intelligence corps of the German Air Force.

Lei Zhang

FIZ Karlsruhe – Leibniz Institute for Information Infrastructure
https://www.fiz-karlsruhe.de/

Maria Koutraki

FIZ Karlsruhe – Leibniz Institute for Information Infrastructure
https://www.fiz-karlsruhe.de/

Rima Türker

PhD student at Karlsruhe Institute of Technology (KIT) & FIZ-Karlsruhe

FIZ Karlsruhe – Leibniz Institute for Information Infrastructure
https://www.fiz-karlsruhe.de/

Search form

“The Less Is More” for Text Classification

Speakers:

Harald Sack

Lei Zhang

Maria Koutraki

Rima Türker