Toward Robust Audio-Visual Synchronization Detection in Egocentric Video with Sparse Synchronization Events


Jordan Voas (The University of Texas at Austin), Wei-Cheng Tseng (The University of Texas at Austin), Benoit Vallade (Amazon Prime Video), Alex Mackin (Amazon Prime Video), David Higham (Amazon Prime Video), David Harwath (The University of Texas at Austin)
The 35th British Machine Vision Conference

Abstract

Audio-Visual Synchronization Detection (AVS) is a core task in multimodal video quality analysis, yet most existing methods are developed and evaluated on domains with limited diversity of sparse events or predominantly dense, repetitive cues, such as talking heads or scripted broadcasts, restricting their generalization to many real-world scenarios. We present the first comprehensive study of AVS in the challenging domain of egocentric video, using the Ego4D dataset as a benchmark. Motivated by the growing use of head-mounted and body-worn cameras in live streaming, augmented reality, law enforcement, and sports, this domain presents unique challenges: sparse, heterogeneous synchronization events, unstable viewpoints, and minimal access to dense anchors like visible faces. Our findings reveal sharp performance drops in existing AVS models on egocentric content. In response, we introduce $\textit{AS-Synchformer}$, a novel streaming AVS model tailored for sparse, unconstrained video. $\textit{AS-Synchformer}$ incorporates three key innovations: (1) a history-aware streaming token selection strategy, (2) a contrastive alignment loss to enforce temporal correspondence for selected streaming tokens, and (3) an Earth Mover’s Distance (EMD) loss to capture ordinal offset structure for the AVS task. These yield substantial gains, including a 3.55% boost in ACC@1 and a 22.3% EMD reduction over strong streaming baselines like APA Synchformer, and a 2.41% ACC@1 gain with a 21.6% EMD reduction over Synchformer in snapshot AVS, setting a new state of the art in both paradigms. Moreover, we investigate the individual impact of full encoder fine-tuning on our model through an ablation study. Our analysis highlights the critical role of encoder fine-tuning in achieving robust AVS under real-world egocentric conditions, representing the first large-scale AVS systems with end-to-end training released.

Citation

@inproceedings{Voas_2025_BMVC,
author    = {Jordan Voas and Wei-Cheng Tseng and Benoit Vallade and Alex Mackin and David Higham and David Harwath},
title     = {Toward Robust Audio-Visual Synchronization Detection in Egocentric Video with Sparse Synchronization Events},
booktitle = {36th British Machine Vision Conference 2025, {BMVC} 2025, Sheffield, UK, November 24-27, 2025},
publisher = {BMVA},
year      = {2025},
url       = {https://bmva-archive.org.uk/bmvc/2025/assets/papers/Paper_903/paper.pdf}
}


Copyright © 2025 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection