AboutThe Fine-Grained Video Classification (FGVC) dataset consists of two subsets, YouTube-Birds and YouTube-Cars, as described in our paper. As the names indicate, all the videos come from YouTube. We employed workers to annotate the videos by watching and judging whether the videos belong to a given class. YouTube-Birds has 12666⁄5684 training/test videos for 200 bird species, while YouTube-Cars has 10238⁄4855 training/test videos for 196 car models. The taxonomies of the two datasets are the same as CUB-200-2011 and the Stanford Cars dataset respectively.
Here we provide some benchmarks on the dataset as well as the YouTube ids of the annotated videos.
|BN-Inception (Single Frame)||60.13||61.96|
|I3D (ResNet-50) ||40.68||40.92|
DownloadClick to download: YouTube-Birds, YouTube-Cars. Here we provide only the video ids. You can get the video from YouTube with the ids.
References C. Joao, and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Computer Vision and Pattern Recognition (CVPR) 2017.
 L. Wang, Y. Xiong , Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. European Conference on Computer Vision (ECCV) 2016.
 C. Zhu, X. Tan, F. Zhou, X. Liu, K. Yue, E. Ding, and Y. Ma. Fine-grained Video Categorization with Redundancy Reduction Attention. European Conference on Computer Vision (ECCV) 2018.