Video recognition usually requires a large amount of training samples, which are
expensive to be collected. An alternative and cheap solution is to draw from the
large-scale images and videos from the Web. With modern search engines, the top
ranked images or videos are usually highly correlated to the query, implying the
potential to harvest the labeling-free Web images and videos for video recognition.
However, there are two key difficulties that prevent us from using the Web data
directly. First, they are typically noisy and may be from a completely different
domain from that of users’ interest (e.g. cartoons). Second, Web videos are usually
untrimmed and very lengthy, where some query-relevant frames are often hidden in
between the irrelevant ones. A question thus naturally arises: to what extent can
such noisy Web images and videos be utilized for labeling-free video recognition?
In this paper, we propose a novel approach to mutually voting for relevant Web
images and video frames, where two forces are balanced, i.e. aggressive matching
and passive video frame selection. We validate our approach on three large-scale
video recognition datasets.