YouTube-BB Dataset | Google Research

Download

The dataset is composed of four CSV files:

Classification in Video Segments - training set (27Mb gzip compressed)
Classification in Video Segments - validation set (3.4Mb gzip compressed)
Object Detection in Video Segments - training set (57Mb gzip compressed)
Object Detection in Video Segments - validation set (6.3Mb gzip compressed)

In the classification CSV files, each row describes one frame and the columns are organized as follows:

youtube_id - (string) the YouTube identifier of the video the segment was extracted from. One may view the selected video at http://youtu.be/${youtube_id}.
timestamp_ms - (integer) the time in milliseconds of the classified frame in the video.
class_id - (integer) a numeric identifier for the object class.
class_name - (string) a human-readable name for the object class.
object_presence - (string) whether or not the object is present in the frame ('present' or 'absent').

In the detection CSV files, each row describes one frame and the columns are organized as follows:

youtube_id - same as above.
timestamp_ms - same as above.
class_id - same as above.
class_name - same as above.
object_id - (integer) an identifier of the object in the video. (see note below)
object_presence - same as above.
xmin - (float) a [0.0, 1.0] number indicating the left-most location of the bounding box in coordinates relative to the frame size.
xmax - (float) a [0.0, 1.0] number indicating the right-most location of the bounding box in coordinates relative to the frame size.
ymin - (float) a [0.0, 1.0] number indicating the top-most location of the bounding box in coordinates relative to the frame size.
ymax - (float) a [0.0, 1.0] number indicating the bottom-most location of the bounding box in coordinates relative to the frame size.

CAUTION: At most one object is tracked for each video segment, but multiple segments could be extracted from the same video. This means that the youtube_id is not enough to identify a video segment uniquely: it must be used in conjuction with the class_id and object_id values.

Tips

Decode video frames at 30 frames per second and always employ accurate_seek options to ensure accurately retrieving the frames.
Always employ the highest-quality video format.
Over time YouTube videos may be deleted or taken down from public viewing by the uploading user.