Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation

Sobhan Asasi (University of Surrey), Mohamed Ilyes Lakhal (University of Surrey), Ozge Mercanoglu Sincan (University of Surrey), Richard Bowden (University of Surrey)

The 35^th British Machine Vision Conference

Abstract

Sign Language Translation (SLT) is a challenging task that requires bridging the modality gap between visual and linguistic information while capturing subtle variations in hand shapes and movements. To address these challenges, we introduce \textbf{BeyondGloss}, a novel gloss-free SLT framework that leverages the spatio-temporal reasoning capabilities of Video Large Language Models (VideoLLMs). Since existing VideoLLMs struggle to model long videos in detail, we propose a novel approach to generate fine-grained, temporally-aware textual descriptions of hand motion. A contrastive alignment module aligns these descriptions with video features during pre-training, encouraging the model to focus on hand-centric temporal dynamics and distinguish signs more effectively. To further enrich hand-specific representations, we distill fine-grained features from HaMeR. Additionally, we apply a contrastive loss between sign video representations and target language embeddings to reduce the modality gap in pre-training. \textbf{BeyondGloss} achieves state-of-the-art performance on the Phoenix14T and CSL-Daily benchmarks, demonstrating the effectiveness of the proposed framework. We will release the code upon acceptance of the paper.

Citation

@inproceedings{Asasi_2025_BMVC,
author    = {Sobhan Asasi and Mohamed Ilyes Lakhal and Ozge Mercanoglu Sincan and Richard Bowden},
title     = {Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation},
booktitle = {36th British Machine Vision Conference 2025, {BMVC} 2025, Sheffield, UK, November 24-27, 2025},
publisher = {BMVA},
year      = {2025},
url       = {https://bmva-archive.org.uk/bmvc/2025/assets/papers/Paper_626/paper.pdf}
}

Copyright © 2025 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection

body { background-color: white !important; color: black !important; }Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation

Sobhan Asasi (University of Surrey), Mohamed Ilyes Lakhal (University of Surrey), Ozge Mercanoglu Sincan (University of Surrey), Richard Bowden (University of Surrey)

Sobhan Asasi (University of Surrey), Mohamed Ilyes Lakhal (University of Surrey), Ozge Mercanoglu Sincan (University of Surrey), Richard Bowden (University of Surrey)

The 35th British Machine Vision Conference

Abstract

Citation

Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation

The 35^th British Machine Vision Conference