DeViSE: A Deep Visual-Semantic Embedding Model
Venue
Neural Information Processing Systems (NIPS) (2013)
Publication Year
2013
Authors
Andrea Frome, Greg Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, Tomas Mikolov
BibTeX
Abstract
Modern visual recognition systems are often limited in their ability to scale to
large numbers of object categories. This limitation is in part due to the
increasing difficulty of acquiring sufficient training data in the form of labeled
images as the number of object categories grows. One remedy is to leverage data
from other sources – such as text data – both to train visual models and to
constrain their predictions. In this paper we present a new deep visual-semantic
embedding model trained to identify visual objects using both labeled image data as
well as semantic information gleaned from unannotated text. We demonstrate that
this model matches state-of-the-art performance on the 1000-class ImageNet object
recognition challenge while making more semantically reasonable errors, and also
show that the semantic information can be exploited to make predictions about tens
of thousands of image labels not observed during training. Semantic knowledge
improves such zero-shot predictions achieving hit rates of up to 18% across
thousands of novel labels never seen by the visual model.
