Distributed Representations of Sentences and Documents
Venue
International Conference on Machine Learning (2014)
Publication Year
2014
Authors
Quoc V. Le, Tomas Mikolov
BibTeX
Abstract
Many machine learning algorithms require the input to be represented as a
fixed-length feature vector. When it comes to texts, one of the most common
fixed-length features is bag-of-words. Despite their popularity, bag-of-words
features have two major weaknesses: they lose the ordering of the words and they
also ignore semantics of the words. For example, “powerful,” “strong” and “Paris”
are equally distant. In this paper, we propose Paragraph Vector, an unsupervised
algorithm that learns fixed-length feature representations from variable-length
pieces of texts, such as sentences, paragraphs, and documents. Our algorithm
represents each document by a dense vector which is trained to predict words in the
document. Its construction gives our algorithm the potential to overcome the
weaknesses of bag-ofwords models. Empirical results show that Paragraph Vectors
outperform bag-of-words models as well as other techniques for text
representations. Finally, we achieve new state-of-the-art results on several text
classification and sentiment analysis tasks.
