SPARTAN: Spatiotemporal Pose-Aware Retrieval for Text-guided Autonomous Navigation


Xiangyu Bai (Northeastern University), Sai Anish Sreeramagiri (Northeastern University), Sai Siddhartha Vivek Dhir Rangoju (Northeastern University), Bishoy Galoaa (Northeastern University), Eric C Mortin (DEVCOM Analysis Center (DAC), US Army), Sarah Ostadabbas (Northeastern University)
The 35th British Machine Vision Conference

Abstract

Generative AI—particularly video diffusion models—offers a scalable alternative to traditional data collection for autonomous navigation. However, current models are not optimized for navigation-specific tasks, often resulting in inaccuracies in physics modeling and scene dynamics. We present SPARTAN (Spatiotemporal Pose-Aware Retrieval for Text-guided Autonomous Navigation), an open-source framework that delivers significant advances to video diffusion components. At its core, SPARTAN features a novel spatiotemporal encoder that converts per-frame camera pose data into continuous spatiotemporal feature embeddings, enhancing representation and modeling efficiency. We also propose a camera pose-conditioned training pipeline and loss function that tightly integrates spatiotemporal features with text annotations to support more accurate retrieval and generation. In addition, we present DrivingScenePTX, a comprehensive driving video dataset that includes both frame-wise camera poses and rich textual scene descriptions. We benchmark SPARTAN against state-of-the-art contrastive language-image pretraining (CLIP) models using standard retrieval tasks and introduce a novel evaluation method inspired by visual simultaneous localization and mapping (vSLAM) to assess performance in cross-domain trajectory retrieval. Our results demonstrate SPARTAN’s superior ability to retrieve driving videos with high spatial and temporal accuracy, offering a critical step forward in adapting generative AI for autonomous navigation.

Citation

@inproceedings{Bai_2025_BMVC,
author    = {Xiangyu Bai and Sai Anish Sreeramagiri and Sai Siddhartha Vivek Dhir Rangoju and Bishoy Galoaa and Eric C Mortin and Sarah Ostadabbas},
title     = {SPARTAN: Spatiotemporal Pose-Aware Retrieval for Text-guided Autonomous Navigation},
booktitle = {36th British Machine Vision Conference 2025, {BMVC} 2025, Sheffield, UK, November 24-27, 2025},
publisher = {BMVA},
year      = {2025},
url       = {https://bmva-archive.org.uk/bmvc/2025/assets/papers/Paper_29/paper.pdf}
}


Copyright © 2025 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection