Category-level Text-to-Image Retrieval Improved: Bridging the Domain Gap with Diffusion Models and Vision Encoders


Faizan Farooq Khan (King Abdullah University of Science and Technology), Vladan Stojnić (Czech Technical University in Prague), Zakaria Laskar (Czech Technical University in Prague), Mohamed Elhoseiny (King Abdullah University of Science and Technology), Giorgos Tolias (Czech Technical University of Prague)
The 35th British Machine Vision Conference

Abstract

This work explores text-to-image retrieval for queries that specify or describe a semantic category. While vision-and-language models (VLMs) like CLIP offer a straightforward open-vocabulary solution, they map text and images to distant regions in the representation space, limiting retrieval performance. To bridge this modality gap, we propose a two-step approach. First, we transform the text query into a visual query using a generative diffusion model. Then, we estimate image-to-image similarity with a vision model. Additionally, we introduce an aggregation network that combines multiple generated images into a single vector representation and fuses similarity scores across both query modalities. Our approach leverages advancements in vision encoders, VLMs, and text-to-image generation models. Extensive evaluations show that it consistently outperforms retrieval methods relying solely on text queries. Source code is available at: https://github.com/faixan-khan/cletir

Citation

@inproceedings{Khan_2025_BMVC,
author    = {Faizan Farooq Khan and Vladan Stojnić and Zakaria Laskar and Mohamed Elhoseiny and Giorgos Tolias},
title     = {Category-level Text-to-Image Retrieval Improved: Bridging the Domain Gap with Diffusion Models and Vision Encoders},
booktitle = {36th British Machine Vision Conference 2025, {BMVC} 2025, Sheffield, UK, November 24-27, 2025},
publisher = {BMVA},
year      = {2025},
url       = {https://bmva-archive.org.uk/bmvc/2025/assets/papers/Paper_701/paper.pdf}
}


Copyright © 2025 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection