From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

Zizhao Li (University of Melbourne), Zhengkang Xiang (University of Melbourne), Joseph West (University of Melbourne), Kourosh Khoshelham (University of Melbourne)

The 35^th British Machine Vision Conference

PDF Poster Video (Right click to download)Supplementary

Abstract

Traditional object detection methods operate under the closed-set assumption, where models can only detect a fixed number of objects predefined in the training set. Recent works on open vocabulary object detection (OVD) enable the detection of objects defined by an in-principle unbounded vocabulary, which reduces the cost of training models for specific tasks. However, OVD heavily relies on accurate prompts provided by an ``oracle'', which limits their use in critical applications such as driving scene perception. OVD models tend to misclassify near-out-of-distribution (NOOD) objects that have similar features to known classes, and ignore far-out-of-distribution (FOOD) objects. To address these limitations, we propose a framework that enables OVD models to operate in open world settings, by identifying and incrementally learning previously unseen objects. To detect FOOD objects, we propose Open World Embedding Learning (OWEL) and introduce the concept of Pseudo Unknown Embedding which infers the location of unknown classes in a continuous semantic space based on the information of known classes. We also propose Multi-Scale Contrastive Anchor Learning (MSCAL), which enables the identification of misclassified unknown objects by promoting the intra-class consistency of object embeddings at different scales. The proposed method achieves state-of-the-art performance on standard open world object detection and autonomous driving benchmarks while maintaining its open vocabulary object detection capability.

Citation

@inproceedings{Li_2025_BMVC,
author    = {Zizhao Li and Zhengkang Xiang and Joseph West and Kourosh Khoshelham},
title     = {From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects},
booktitle = {36th British Machine Vision Conference 2025, {BMVC} 2025, Sheffield, UK, November 24-27, 2025},
publisher = {BMVA},
year      = {2025},
url       = {https://bmva-archive.org.uk/bmvc/2025/assets/papers/Paper_717/paper.pdf}
}

Copyright © 2025 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection

body { background-color: white !important; color: black !important; }From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

Zizhao Li (University of Melbourne), Zhengkang Xiang (University of Melbourne), Joseph West (University of Melbourne), Kourosh Khoshelham (University of Melbourne)

Zizhao Li (University of Melbourne), Zhengkang Xiang (University of Melbourne), Joseph West (University of Melbourne), Kourosh Khoshelham (University of Melbourne)

The 35th British Machine Vision Conference

Abstract

Citation

From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

The 35^th British Machine Vision Conference