Leveraging Sparsity for Efficient Inference of High-Resolution Vision Foundation Models


Xin Xu (University of Illinois Urbana-Champaign), Jason Kuen (Adobe), Brian L Price (Adobe), Kangning Liu (Adobe), Zijun Wei (Adobe), Yu-Xiong Wang (University of Illinois Urbana-Champaign)
The 35th British Machine Vision Conference

Abstract

Resolution scaling enhances the performance of Vision Transformers (ViTs) but incurs substantial computational costs due to the quadratic time complexity of self-attention. Our study reveals that sparsity naturally emerges in the attention maps of pre-trained vision encoders. Building on this observation, we introduce Sparse Vision Encoder (SVE), a post-training optimization framework that exploits sparsity to accelerate inference of vision encoders. SVE selectively applies sparsity in key layers, performs sparsity distillation to adapt models to sparse attention, and incorporates a lightweight predictor to eliminate redundant computations. Furthermore, we leverage cross-layer consistency in sparsity patterns, enabling efficient reuse of sparsity structures. Experiments on DINOv2, CLIP, and SAM2 show that SVE scales effectively for high-resolution encoding, delivering up to 80\% speedup while preserving model performance. This demonstrates that SVE is a scalable and cost-effective solution for high-resolution visual representation. Our code is available at \url{https://github.com/xuxalan/SVE}.

Citation

@inproceedings{Xu_2025_BMVC,
author    = {Xin Xu and Jason Kuen and Brian L Price and Kangning Liu and Zijun Wei and Yu-Xiong Wang},
title     = {Leveraging Sparsity for Efficient Inference of High-Resolution Vision Foundation Models},
booktitle = {36th British Machine Vision Conference 2025, {BMVC} 2025, Sheffield, UK, November 24-27, 2025},
publisher = {BMVA},
year      = {2025},
url       = {https://bmva-archive.org.uk/bmvc/2025/assets/papers/Paper_661/paper.pdf}
}


Copyright © 2025 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection