Image Recognition with Vision and Language Embeddings of VLMs


Illia Volkov (Czech Technical University of Prague), Nikita Kisel (Czech Technical Univeresity in Prague), Klara Janouskova (Czech Technical Univeresity in Prague), Jiri Matas (Czech Technical University in Prague)
The 35th British Machine Vision Conference

Abstract

Vision-language models (VLMs) have enabled strong zero-shot classification through image–text alignment. Yet, their purely visual inference capabilities remain under-explored. In this work, we conduct a comprehensive evaluation of both language-guided and vision-only image classification with a diverse set of dual-encoder VLMs, including both well-established and recent models such as SigLIP 2 and RADIOv2.5. The performance is compared in a standard setup on the ImageNet-1k validation set and its label-corrected variant. The key factors affecting accuracy are analysed, including prompt design, class diversity, the number of neighbours in $k$-NN, and reference set size. We show that language and vision offer complementary strengths, with some classes favouring textual prompts and others better handled by visual similarity. To exploit this complementarity, we introduce a simple, learning-free fusion method based on per-class precision that improves classification performance. The code is available at: https://github.com/gonikisgo/bmvc2025-vlm-image-recognition.

Citation

@inproceedings{Volkov_2025_BMVC,
author    = {Illia Volkov and Nikita Kisel and Klara Janouskova and Jiri Matas},
title     = {Image Recognition with Vision and Language Embeddings of VLMs},
booktitle = {36th British Machine Vision Conference 2025, {BMVC} 2025, Sheffield, UK, November 24-27, 2025},
publisher = {BMVA},
year      = {2025},
url       = {https://bmva-archive.org.uk/bmvc/2025/assets/papers/Paper_866/paper.pdf}
}


Copyright © 2025 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection