Teaching Matters: Investigating the Role of Supervision in Vision Transformers

Attention Findings

Clear differences in attention emerge in the mid-to-late layers under different supervision methods. These plots show the attention maps of CLS tokens averaged over 5000 ImageNet images. Rows indicate layers and columns indicate heads. For brevity, we show only three heads per layer. The bracketed numbers in the lower half denote the layer and head.

Multiple distinct forms of local attention exist. We visualize spatial token attention using Aligned Aggregated Attention Maps, and highlight different types of local attention heads, including Strict, Soft, Axial, and Offset Local Attention Heads. In the bottom right, the mid-lines are shown in red for reference.

Different methods of supervision lead to different orderings and ratios of local and global processing. We show the Average Attention Distance of all ViT attention heads organized by layer (left), and the per-layer averages (right).

Attention IoU with salient content plateaus early for all ViTs evaluated. We calculate the alignment of ground-truth segmentation masks with CLS token attention maps (left) and the average of spatial token attention maps (right).

Feature Findings

On the left we show clustering purity analysis with image-level labels in ImageNet-50 for CLS token features. The center and right plots show the performance on the ImageNet and Oxford-5k datasets for the k-NN Classification and Retrieval tasks respectively. Both these tasks also utilize the CLS token for evaluation.

For Part Clustering on PartImagenet (left) the self-supervised methods perform competitively. On the dense prediction task of Video Segmentation (right), DINO and MOCO again show good performance.

Downstream Task Findings

Local (pixel-level) downstream task analysis using dense spatial token features. We perform DAVIS video segmentation (left) and SPair-71k keypoint correspondence (right).

Best performance for each ViT on each downstream task with the corresponding best layer in parenthesis. As shown in the table, no one model performs best at all tasks. Also, which layer is best performing depends on the task and training method.

Abstract

Key Findings

Attention Findings

Feature Findings

Downstream Task Findings

Paper