“The Less Is More” for Text Classification

Poster & Demo

Nowadays, Text Classification [2,5] is gaining more attention due to the availability of a huge number of text data, such as blog articles and news data.
Traditional text classification methods [1], use all the words present in a given
text to represent a document. However, the high number of words mentioned in
documents can tremendously increase the complexity of the classification task
and subsequently make it very costly. Moreover, long (natural language text)
documents usually include a different variety of information related to the topic
of a document. Especially, encyclopedic articles such as a life of a scientist,
contain detailed information about a person. Usually, in such articles after the
first paragraph (or first a few sentences), words or entities tend to appear which are not related to the main topic (or category 4 ) of the article. We assume that the most informative part of such articles is a few starting sentences. In other words, instead of considering the complete document, only the beginning of it can be exploited to classify a document accurately.
In this study, we design a Knowledge Based Text Classification method which
is able to classify a document by using only a few starting sentences of the
article. Since the length of the considered text is rather limited, ambiguous words might lead to inaccurate classification results. Therefore, instead of words, we consider entities to represent a document. In addition, entities and categories are embedded into a common vector space which allows capturing the semantic similarity between them. Moreover, the approach does not require any labeled data as a prerequisite. Instead, it relies on the semantic similarity between a set of predefined categories and a given document to determine which category the given document belongs to.

Speakers: