FGVC: the Fine-Grained Video Classification Dataset


The Fine-Grained Video Classification (FGVC) dataset consists of two subsets, YouTube-Birds and YouTube-Cars, as described in our paper. As the names indicate, all the videos come from YouTube. We employed workers to annotate the videos by watching and judging whether the videos belong to a given class. YouTube-Birds has 126665684 training/test videos for 200 bird species, while YouTube-Cars has 102384855 training/test videos for 196 car models. The taxonomies of the two datasets are the same as CUB-200-2011 and the Stanford Cars dataset respectively.


Here we provide some benchmarks on the dataset as well as the YouTube ids of the annotated videos.


Accuracies of different models on the datasets
Method YouTube-Birds YouTube-Cars
BN-Inception (Single Frame) 60.13 61.96
I3D (ResNet-50) [1] 40.68 40.92
TSN [2] 72.36 74.34
RRA [3] 73.21 77.63


Click to download: YouTube-Birds, YouTube-Cars. Here we provide only the video ids. You can get the video from YouTube with the ids.


[1] C. Joao, and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Computer Vision and Pattern Recognition (CVPR) 2017.
[2] L. Wang, Y. Xiong , Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. European Conference on Computer Vision (ECCV) 2016.
[3] C. Zhu, X. Tan, F. Zhou, X. Liu, K. Yue, E. Ding, and Y. Ma. Fine-grained Video Categorization with Redundancy Reduction Attention. European Conference on Computer Vision (ECCV) 2018.