Tracking Meets Large Multimodal Models for Driving Scenario Understanding


Ayesha Ishaq (Mohamed bin Zayed University of Artificial Intelligence), Jean Lahoud (Mohamed bin Zayed University of Artificial Intelligence), Fahad Shahbaz Khan (Mohamed bin Zayed University of Artificial Intelligence), Salman Khan (Mohamed bin Zayed University of Artificial Intelligence), Hisham Cholakkal (Mohamed bin Zayed University of Artificial Intelligence), Rao Muhammad Anwer (Mohamed bin Zayed University of Artificial Intelligence)
The 35th British Machine Vision Conference

Abstract

Large Multimodal Models (LMMs) have recently gained prominence in autonomous driving research, showcasing promising capabilities across various emerging benchmarks. LMMs specifically designed for this domain have demonstrated effective perception, planning, and prediction skills. However, many of these methods underutilize 3D spatial and temporal elements, relying mainly on image data. As a result, their effectiveness in dynamic driving environments is limited. We propose to integrate tracking information as an additional input to recover 3D spatial and temporal details that are not effectively captured in the images. We introduce a novel approach for embedding this tracking information into LMMs to enhance their spatiotemporal understanding of driving scenarios. By incorporating 3D tracking data through a track encoder, we enrich visual queries with crucial spatial and temporal cues while avoiding the computational overhead associated with processing lengthy video sequences or extensive 3D inputs. Moreover, we employ a self-supervised approach to pretrain the tracking encoder to provide LMMs with additional contextual information, significantly improving their performance in perception, planning, and prediction tasks for autonomous driving. Experimental results demonstrate the effectiveness of our approach, with a gain of 9.5\% in accuracy, an increase of 7.04 points in the ChatGPT score, and 9.4\% increase in the overall score over baseline models on DriveLM-nuScenes benchmark, along with a 3.7\% final score improvement on DriveLM-CARLA. Our code is available at https://github.com/mbzuai-oryx/TrackingMeetsLMM

Citation

@inproceedings{Ishaq_2025_BMVC,
author    = {Ayesha Ishaq and Jean Lahoud and Fahad Shahbaz Khan and Salman Khan and Hisham Cholakkal and Rao Muhammad Anwer},
title     = {Tracking Meets Large Multimodal Models for Driving Scenario Understanding},
booktitle = {36th British Machine Vision Conference 2025, {BMVC} 2025, Sheffield, UK, November 24-27, 2025},
publisher = {BMVA},
year      = {2025},
url       = {https://bmva-archive.org.uk/bmvc/2025/assets/papers/Paper_948/paper.pdf}
}


Copyright © 2025 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection