Overview
Task-agnostic feature upsampling has emerged as a way to obtain dense visual features from pre-trained backbones without paying the full quadratic cost of high-resolution self-attention. However, recent approaches to this problem rely on cross-attention-based feature pooling, leading to the same quadratic scaling problems as the underlying backbones. We introduce UPLiFT (Universal Pixel-dense Lightweight Feature Transforms), an iterative upsampler that revisits early convolutional approaches and shows they can match or surpass recent cross-attention-based methods at lower inference cost.
UPLiFT is built around a new Local Attender module, a fully local attentional pooling operator that aggregates features over a fixed neighborhood using learned weights, avoiding global query–key–value attention while still preserving the backbone’s feature distribution. This enables UPLiFT to produce semantically stable, pixel-dense features more efficiently than recent methods. UPLiFT-upsampled features also achieve competitive or superior performance in a range discriminative and generative downstream tasks.
Applications of UPLiFT
UPLiFT is designed as a task-agnostic feature upsampler and can be plugged into both discriminative and generative pipelines without modifying the underlying backbone or generator.
- Predictive tasks: UPLiFT upsamples DINOv2 or DINOv3 features for semantic segmentation and monocular depth estimation, improving mIoU and depth accuracy over prior feature upsamplers while running faster than recent cross-attention-based methods.
- Generative tasks: Applied to Stable Diffusion VAEs, UPLiFT upsamples latent codes for efficient text-to-image upscaling and 4× image super-resolution, reaching quality comparable to CFM with significantly lower compute and fewer parameters.
Method Overview
UPLiFT follows an iterative upsampling design: a single compact decoder is applied multiple times to grow coarse feature maps to pixel density, guided by shallow, high-resolution encoder features from the input image. At each step, a Local Attender enforces consistency with the original backbone features while only accessing a small neighborhood around each token.
Key Components
- UPLiFT Encoder ($E_{\text{UPLiFT}}$): A shallow convolutional encoder that processes the input image once and outputs dense, high-resolution guide features. These features are downsampled via nearest-neighbor for each upsampling step, avoiding repeated encodings at larger resolutions.
- UPLiFT Decoder ($D_{\text{UPLiFT}}$): A lightweight convolutional decoder trained to perform 2× upsampling. The same module is reused across steps to grow low-resolution backbone features to pixel-dense maps.
- Local Attender: A local attention operator that uses the guide features to predict attention weights over a fixed offset neighborhood around each low-resolution token, and then linearly recombines value features from the backbone. This preserves the backbone’s feature distribution while avoiding global attention.
- Multi-step training: UPLiFT is trained with a feature reconstruction loss at multiple resolutions, encouraging stability across all intermediate upsampling stages.
Local Attender Operator
Inputs: Guide feature map $G \in \mathbb{R}^{H_g \times W_g \times C_G}$, value feature map $V \in \mathbb{R}^{H_v \times W_v \times C_V}$, neighborhood offsets $\mathcal{N}$
Output: Upsampled feature map $Y$ aligned with $G$
| 1: | Project $G$ with a $1 \times 1$ convolution to logits $A \in \mathbb{R}^{H_g \times W_g \times |\mathcal{N}|}$ |
| 2: | Apply softmax over the neighborhood dimension to obtain attention weights $\alpha$ |
| 3: | For each spatial position $(x, y)$ in $G$, gather local value features $\{V_{x+i, y+j} : (i,j) \in \mathcal{N}\}$ (with padding) from $V$ |
| 4: | Compute $Y_{x,y} = \sum_{k=1}^{|\mathcal{N}|} \alpha_{x,y,k} \, V_{x+i_k, y+j_k}$ |
| 5: | Return $Y$ as the locally attended, upsampled value feature map |
Because the neighborhood size $|\mathcal{N}|$ is fixed, the computational and memory cost of the Local Attender scales as $\mathcal{O}(|\mathcal{N}| \cdot T)$, where $T$ is the number of spatial tokens, yielding linear scaling in the number of patches.
Results
Predictive Tasks: Segmentation and Depth
Using DINOv2-S/14 as the backbone and training only linear probes on top of upsampled features, UPLiFT achieves higher semantic segmentation performance than prior feature upsamplers across COCO-Stuff, VOC, ADE20K, and Cityscapes, while maintaining lower inference times than recent cross-attention-based alternatives.
For monocular depth on COCO-Stuff, UPLiFT attains competitive thresholded accuracy and the lowest or near-lowest RMSE among all methods, indicating that local upsampling with the Local Attender still captures the global structure required for depth reasoning.
Generative Tasks: Text-to-Image Upscaling and Super-Resolution
On COCO and reLAION benchmarks, UPLiFT improves FID and related metrics for 512→1024 text-to-image upscaling compared to CFM, while running faster. For 4× super-resolution on FacesHQ and LHQ, UPLiFT provides strong SSIM and PSNR with only two upsampling steps and latency close to simple bilinear upsampling in latent space.
Citation
If you find our work useful in your research, please consider citing:
@article{walmer2026uplift,
title={UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders},
author={Walmer, Matthew and Suri, Saksham and Aggarwal, Anirud and Shrivastava, Abhinav},
journal={arXiv preprint arXiv:2601.17950},
year={2026}
}