Mask2Act: Predictive Multi-Object Tracking as Video Pre-Training for Robot Manipulation


Junbo Zhang (Tsinghua University), Kaisheng Ma (Tsinghua University)
The 35th British Machine Vision Conference

Abstract

Imitation learning from demonstrations is a promising paradigm in robot learning. However, it requires a large amount of robot demonstration data, which is laborious to collect. To address this challenge, recent methods leverage action-free video data to extract skill knowledge and generate latent plans, which contain the predicted future states and guide policy learning. In this paper, we introduce Mask2Act, a novel framework that leverages action-free videos to train a predictive multi-object tracking model to predict future masks of any task-relevant objects in the video as the latent plans. The transition of object masks enriches the latent plans with accurate task-relevant motion knowledge and eliminates distracting information. The latent plans are subsequently utilized to guide policy learning with limited training data. Experiments across 50 manipulation tasks in 2 simulated environments show that our method significantly outperforms other video pre-training methods. Furthermore, Mask2Act inherently guides the model to focus on task-relevant knowledge rather than irrelevant background and distractor information, thus demonstrating superior capability in extracting skill knowledge from cross-task videos and generalizing to new tasks and environments.

Citation

@inproceedings{Zhang_2025_BMVC,
author    = {Junbo Zhang and Kaisheng Ma},
title     = {Mask2Act: Predictive Multi-Object Tracking as Video Pre-Training for Robot Manipulation},
booktitle = {36th British Machine Vision Conference 2025, {BMVC} 2025, Sheffield, UK, November 24-27, 2025},
publisher = {BMVA},
year      = {2025},
url       = {https://bmva-archive.org.uk/bmvc/2025/assets/papers/Paper_124/paper.pdf}
}


Copyright © 2025 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection