Size-aware Contrastive Imitation Learning for Language-conditioned Multi-task Robotic Manipulation


Jiakai Huang (South China Normal University), Weiping Zheng (South China Normal University)
The 35th British Machine Vision Conference

Abstract

Translating high-level linguistic instructions into geometrically consistent manipulation actions is a critical challenge in robotics, especially when dealing with objects of diverse geometric properties, such as size and shape. This task demands both linguistic comprehension and precise geometric reasoning from the robot agent. Previous approaches have primarily focused on enhancing visual precision and integrating language embeddings, often overlooking the alignment of subtle geometric features with linguistic representations. In this paper, we propose Size-Aware Contrastive Imitation Learning (SACIL), a novel framework that addresses this gap through two key components: Image-Text Contrast and Current-Goal Contrast. These components ensure the alignment of language embeddings with geometric features and maintain temporal consistency in size-aware reasoning across multi-step tasks. Additionally, we introduce a set of size-aware query tokens to effectively aggregate geometric features. Our experimental results demonstrate that SACIL significantly outperforms state-of-the-art methods, highlighting its potential to enhance size-sensitive reasoning and advance language-conditioned robotic manipulation.

Citation

@inproceedings{Huang_2025_BMVC,
author    = {Jiakai Huang and Weiping Zheng},
title     = {Size-aware Contrastive Imitation Learning for Language-conditioned Multi-task Robotic Manipulation},
booktitle = {36th British Machine Vision Conference 2025, {BMVC} 2025, Sheffield, UK, November 24-27, 2025},
publisher = {BMVA},
year      = {2025},
url       = {https://bmva-archive.org.uk/bmvc/2025/assets/papers/Paper_716/paper.pdf}
}


Copyright © 2025 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection