Teaching Matters: Investigating the Role of Supervision in Vision Transformers

Matthew Walmer*
Saksham Suri*
Kamal Gupta
Abhinav Shrivastava


ViTs exhibit highly varied behaviors depending on their method of training. In this work, we compare ViTs through three domains of analysis representing the How, What, and Why of ViTs. How do ViTs process information through attention? (Left) Attention maps averaged over 5000 images show clear differences in the mid-to-late layers. What do ViTs learn to represent? (Center) Contrastive self-supervised ViTs have a greater feature similarity to explicitly supervised ViTs, but also have some similarity with ViTs trained through masked reconstruction. Why do we care about using ViTs? (Right) We evaluate ViTs on a variety of global and local tasks and show that the best model and layer vary greatly.


Vision Transformers (ViTs) have gained significant popularity in recent years and have proliferated into many applications. However, it is not well explored how varied their behavior is under different learning paradigms. We compare ViTs trained through different methods of supervision, and show that they learn a diverse range of behaviors in terms of their attention, representations, and downstream performance. We also discover ViT behaviors that are consistent across supervision, including the emergence of Offset Local Attention Heads. These are self-attention heads which attend to a token adjacent to the current token with a fixed directional offset, a phenomenon that to the best of our knowledge has not been highlighted in any prior work. Through our analysis, we show that ViTs are highly flexible and learn to process local and global information in different orders depending on their training method. We find that contrastive self-supervised methods learn features that are competitive with explicitly supervised features, and they can even be superior for part-level tasks. We also find that the representations of reconstruction-based models show non-trivial similarity to contrastive self-supervised models. Finally, we show how the "best" layer for a given task varies by both supervision method and task, further demonstrating the differing order of information processing in ViTs.

Key Findings

Attention Findings

Clear differences in attention emerge in the mid-to-late layers under different supervision methods. These plots show the attention maps of CLS tokens averaged over 5000 ImageNet images. Rows indicate layers and columns indicate heads. For brevity, we show only three heads per layer. The bracketed numbers in the lower half denote the layer and head.

Multiple distinct forms of local attention exist. We visualize spatial token attention using Aligned Aggregated Attention Maps, and highlight different types of local attention heads, including Strict, Soft, Axial, and Offset Local Attention Heads. In the bottom right, the mid-lines are shown in red for reference.

Different methods of supervision lead to different orderings and ratios of local and global processing. We show the Average Attention Distance of all ViT attention heads organized by layer (left), and the per-layer averages (right).

Attention IoU with salient content plateaus early for all ViTs evaluated. We calculate the alignment of ground-truth segmentation masks with CLS token attention maps (left) and the average of spatial token attention maps (right).

Feature Findings

On the left we show clustering purity analysis with image-level labels in ImageNet-50 for CLS token features. The center and right plots show the performance on the ImageNet and Oxford-5k datasets for the k-NN Classification and Retrieval tasks respectively. Both these tasks also utilize the CLS token for evaluation.

For Part Clustering on PartImagenet (left) the self-supervised methods perform competitively. On the dense prediction task of Video Segmentation (right), DINO and MOCO again show good performance.

Downstream Task Findings

Local (pixel-level) downstream task analysis using dense spatial token features. We perform DAVIS video segmentation (left) and SPair-71k keypoint correspondence (right).

Best performance for each ViT on each downstream task with the corresponding best layer in parenthesis. As shown in the table, no one model performs best at all tasks. Also, which layer is best performing depends on the task and training method.


M. Walmer*, S. Suri*, K. Gupta, A. Shrivastava
Teaching Matters: Investigating the Role of Supervision in Vision Transformers (Link)
*Equal Contribution

Template credits