Get Your Hands Dirty: Evaluating Word2Vec Models for Patent Data

Poster & Demo

Patent search systems allow complex queries to be formulated by combining different search terms using Boolean and other operators such as proximity, wildcards, etc. in order to find relevant patents.
This widely adopted approach is based on exact match, making it difficult to efficiently identify and analyze relevant patents, as the search terms often do not match the terminology used by the inventors. Another problem concerns the large number of relevant hits due to weekly and monthly updates of patent applications and grants. Although some semantic search systems for patents based on latent semantic analysis have been implemented as black-box systems in the past, word embeddings that have been successfully applied to generate semantic representations of text have rarely been employed and evaluated for a (large) patent corpus. The work described herein aims to evaluate semantic representations for patent data via a pre-trained general model in comparison to an adapted word embedding model from a patent corpus in order to contribute to a multitude of semantic analysis tasks for patents such as similarity search, content analysis, entity linking etc..