The dataset is composed of four CSV files:
In the classification CSV files, each row describes one frame and the columns are organized as follows:
youtube_id
- (string) the YouTube identifier of the video the segment was extracted from. One may view the selected video at http://youtu.be/${youtube_id}
.timestamp_ms
- (integer) the time in milliseconds of the classified frame in the video.class_id
- (integer) a numeric identifier for the object class.class_name
- (string) a human-readable name for the object class.object_presence
- (string) whether or not the object is present in the frame ('present' or 'absent').In the detection CSV files, each row describes one frame and the columns are organized as follows:
youtube_id
- same as above.timestamp_ms
- same as above.class_id
- same as above.class_name
- same as above.object_id
- (integer) an identifier of the object in the video. (see note below)object_presence
- same as above.xmin
- (float) a [0.0, 1.0] number indicating the left-most location of the bounding box in coordinates relative to the frame size.xmax
- (float) a [0.0, 1.0] number indicating the right-most location of the bounding box in coordinates relative to the frame size.ymin
- (float) a [0.0, 1.0] number indicating the top-most location of the bounding box in coordinates relative to the frame size.ymax
- (float) a [0.0, 1.0] number indicating the bottom-most location of the bounding box in coordinates relative to the frame size.CAUTION: At most one object is tracked for each video segment, but multiple segments could be extracted from the same video. This means that the youtube_id
is not enough to identify a video segment uniquely: it must be used in conjuction with the class_id
and object_id
values.