What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision
Venue
North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL HLT 2015) (to appear)
Publication Year
2015
Authors
Jonathan Malmaud, Jonathan Huang, Vivek Rathod, Nicholas Johnston, Andrew Rabinovich, Kevin Murphy
BibTeX
Abstract
We present a novel method for aligning a sequence of instructions to a video of
someone carrying out a task. In particular, we focus on the cooking domain, where
the instructions correspond to the recipe. Our technique relies on an HMM to align
the recipe steps to the (automatically generated) speech transcript. We then refine
this alignment using a state-of-the-art visual food detector, based on a deep
convolutional neural network. We show that our technique outperforms simpler
techniques based on keyword spotting. It also enables interesting applications,
such as automatically illustrating recipes with keyframes, and searching within a
video for events of interest.
