The dataset is composed of four CSV files:
In the classification CSV files, each row describes one frame and the columns are organized as follows:
youtube_id - (string) the YouTube identifier of the video the segment was extracted from. One may view the selected video at http://youtu.be/${youtube_id}.timestamp_ms - (integer) the time in milliseconds of the classified frame in the video.class_id - (integer) a numeric identifier for the object class.class_name - (string) a human-readable name for the object class.object_presence - (string) whether or not the object is present in the frame ('present' or 'absent').In the detection CSV files, each row describes one frame and the columns are organized as follows:
youtube_id - same as above.timestamp_ms - same as above.class_id - same as above.class_name - same as above.object_id - (integer) an identifier of the object in the video. (see note below)object_presence - same as above.xmin - (float) a [0.0, 1.0] number indicating the left-most location of the bounding box in coordinates relative to the frame size.xmax - (float) a [0.0, 1.0] number indicating the right-most location of the bounding box in coordinates relative to the frame size.ymin - (float) a [0.0, 1.0] number indicating the top-most location of the bounding box in coordinates relative to the frame size.ymax - (float) a [0.0, 1.0] number indicating the bottom-most location of the bounding box in coordinates relative to the frame size.CAUTION: At most one object is tracked for each video segment, but multiple segments could be extracted from the same video. This means that the youtube_id is not enough to identify a video segment uniquely: it must be used in conjuction with the class_id and object_id values.