TAG: A Simple Yet Effective Temporal-Aware Approach for Zero-Shot Video Temporal Grounding


Jin-Seop Lee (Sungkyunkwan University), SungJoon Lee (Sungkyunkwan University), Jaehan Ahn (Sungkyunkwan University), YunSeok Choi (Sungkyunkwan University), Jee-Hyong Lee (Sungkyunkwan University)
The 35th British Machine Vision Conference

Abstract

Video Temporal Grounding (VTG) aims to extract relevant video segments based on a given natural language query. Recently, zero-shot VTG methods have gained attention by leveraging pretrained vision-language models (VLMs) to localize target moments without additional training. However, existing approaches suffer from semantic fragmentation, where temporally continuous frames sharing the same semantics are split across multiple segments. When segments are fragmented, it becomes difficult to predict an accurate target moment that aligns with the text query. Also, they rely on skewed similarity distributions for localization, making it difficult to select the optimal segment. Furthermore, they heavily depend on the use of LLMs which require expensive inferences.To address these limitations, we propose a TAG, a simple yet effective Temporal-Aware approach for zero-shot video temporal Grounding, which incorporates temporal pooling, temporal coherence clustering, and similarity adjustment. Our proposed method effectively captures the temporal context of videos and addresses distorted similarity distributions without training. Our approach achieves state-of-the-art results on Charades-STA and ActivityNet Captions benchmark datasets without rely on LLMs.

Citation

@inproceedings{Lee_2025_BMVC,
author    = {Jin-Seop Lee and SungJoon Lee and Jaehan Ahn and YunSeok Choi and Jee-Hyong Lee},
title     = {TAG: A Simple Yet Effective Temporal-Aware Approach for Zero-Shot Video Temporal Grounding},
booktitle = {36th British Machine Vision Conference 2025, {BMVC} 2025, Sheffield, UK, November 24-27, 2025},
publisher = {BMVA},
year      = {2025},
url       = {https://bmva-archive.org.uk/bmvc/2025/assets/papers/Paper_786/paper.pdf}
}


Copyright © 2025 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection