CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections


Mohamed Fazli Mohamed Imam (Mohamed bin Zayed University of Artificial Intelligence), Rufael Fekadu Marew (Mohamed bin Zayed University of Artificial Intelligence), Jameel Hassan Abdul Samadh (Johns Hopkins University), Mustansar Fiaz (IBM Research), Alham Fikri Aji (Mohamed bin Zayed University of Artificial Intelligence), Hisham Cholakkal (Mohamed bin Zayed University of Artificial Intelligence)
The 35th British Machine Vision Conference

Abstract

In the era of foundation models, CLIP has emerged as a powerful tool for aligning text and visual modalities into a common embedding space. However, the alignment objective used to train CLIP often results in subpar visual features for fine-grained tasks. In contrast, models pretrained in a self-supervised manner, such as DINO, excel at extracting rich visual features due to their specialized training paradigm. Yet, these self-supervised learning (SSL) models require an additional supervised linear probing step, which relies on fully labeled data, often expensive and difficult to obtain at scale. In this paper, we propose a label-free prompt-tuning method that leverages the rich visual features extracted using DINO SSL model and the broad contextual knowledge of LLMs to enhance CLIP’s image classification performance using purely unlabeled images. Our approach unfolds in three key steps: (i) We generate robust textual feature embeddings that more accurately represent object classes by leveraging class-specific descriptions from large language models (LLMs). (ii) The textual embeddings are then used to produce pseudo-labels to train an alignment module that integrates the complementary strengths of LLM description-based textual embeddings and visual features extracted from DINO. (iii) Finally, we prompt-tune CLIP’s vision encoder using the trained alignment module. This three-step process allows us to harness the best of visual and textual foundation models, resulting in a powerful and efficient approach that surpasses state-of-the-art (SOTA) label-free classification methods. Notably, our framework, NoLA (No Labels Attached), achieves an average absolute gain of 3.6% over the state-of-the-art LaFTer across 11 diverse image classification datasets. Our code and models will be made publicly available.

Citation

@inproceedings{Imam_2025_BMVC,
author    = {Mohamed Fazli Mohamed Imam and Rufael Fekadu Marew and Jameel Hassan Abdul Samadh and Mustansar Fiaz and Alham Fikri Aji and Hisham Cholakkal},
title     = {CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections},
booktitle = {36th British Machine Vision Conference 2025, {BMVC} 2025, Sheffield, UK, November 24-27, 2025},
publisher = {BMVA},
year      = {2025},
url       = {https://bmva-archive.org.uk/bmvc/2025/assets/papers/Paper_281/paper.pdf}
}


Copyright © 2025 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection