Grounded compositional semantics for finding and describing images with sentences
Venue
Transactions of the Association for Computational Linguistics (2013) (to appear)
Publication Year
2013
Authors
Richard Socher, Andrej Karpathy, Quoc V. Le, Chris D. Manning, Andrew Y. Ng
BibTeX
Abstract
Previous work on Recursive Neural Networks (RNNs) shows that these models can
produce compositional feature vectors for accurately representing and classifying
sentences or images. However, the sentence vectors of previous models cannot
accurately represent visually grounded meaning. We introduce the DTRNN model which
uses dependency trees to embed sentences into a vector space in order to retrieve
images that are described by those sentences. Unlike previous RNN-based models
which use constituency trees, DT-RNNs naturally focus on the action and agents in a
sentence. They are better able to abstract from the details of word order and
syntactic expression. DT-RNNs outperform other recursive and recurrent neural
networks, kernelized CCA and a bag-of-words baseline on the tasks of finding an
image that fits a sentence description and vice versa. They also give more similar
representations to sentences that describe the same image.
