Unsupervised Learning for Physical Interaction through Video Prediction
Venue
arXiv e-prints (2016)
Publication Year
2016
Authors
Chelsea Finn, Ian Goodfellow, Sergey Levine
BibTeX
Abstract
A core challenge for an agent learning to interact with the world is to predict how
its actions affect objects in its environment. Many existing methods for learning
the dynamics of physical interactions require labeled object information. However,
to scale real-world interaction learning to a variety of scenes and objects,
acquiring labeled data becomes increasingly impractical. To learn about physical
object motion without labels, we develop an action-conditioned video prediction
model that explicitly models pixel motion, by predicting a distribution over pixel
motion from previous frames. Because our model explicitly predicts motion, it is
partially invariant to object appearance, enabling it to generalize to previously
unseen objects. To explore video prediction for real-world interactive agents, we
also introduce a dataset of 50,000 robot interactions involving pushing motions,
including a test set with novel objects. In this dataset, accurate prediction of
videos conditioned on the robot's future actions amounts to learning a "visual
imagination" of different futures based on different courses of action. Our
experiments show that our proposed method not only produces more accurate video
predictions, but also more accurately predicts object motion, when compared to
prior methods.
