Show and tell: A neural image caption generator
Venue
Computer Vision and Pattern Recognition (2015)
Publication Year
2015
Authors
Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan
BibTeX
Abstract
Automatically describing the content of an image is a fundamental problem in
artificial intelligence that connects computer vision and natural language
processing. In this paper, we present a generative model based on a deep recurrent
architecture that combines recent advances in computer vision and machine
translation and that can be used to generate natural sentences describing an image.
The model is trained to maximize the likelihood of the target description sentence
given the training image. Experiments on several datasets show the accuracy of the
model and the fluency of the language it learns solely from image descriptions. Our
model is often quite accurate, which we verify both qualitatively and
quantitatively. For instance, while the current state-of-the-art BLEU score (the
higher the better) on the Pascal dataset is 25, our approach yields 59, to be
compared to human performance around 69. We also show BLEU-1 score improvements on
Flickr30k, from 56 to 66, and on SBU, from 19 to 28. Lastly, on the newly released
COCO dataset, we achieve a BLEU-4 of 27.7, which is the current state-of-the-art.
