Prompt Image to Watch and Hear: Multimodal Prompting for Parameter-Efficient Audio-Visual Learning

Kai Wang (University of Toronto), Shentong Mo (Carnegie Mellon University), Yapeng Tian (University of Texas at Dallas), Dimitrios Hatzinakos (University of Toronto)

The 35^th British Machine Vision Conference

PDF Poster Video (Right click to download)Supplementary

Abstract

We explore how to endow the static and pre-trained image models to watch and hear from the audio-video domain with a limited set of trainable parameters. To achieve this objective, we propose an Audio-Visual Spatial-Temporal-Fusion Prompting, called AV-STFP, to gradually adapt the shared and frozen image model to learn audio-visual representation by decoupling the adaptation into temporal-aware prompting, spatial-aware prompting, and multimodal fusion prompting. First, temporal-aware prompting introduces a set of temporal prompts into tokens along temporal frames to adapt the shared pre-trained image layers to capture temporal patterns. Next, spatial-aware prompting inserts a set of spatial prompts into the tokens to enable the shared image layers to learn audio contexts and further enhance the visual semantics. Finally, multimodal fusion prompting incorporates a set of fusion prompts between audio and visual tokens to enforce the pre-trained image layers to integrate two modalities. By only tuning the inserted learnable prompts and freezing the pre-trained image backbones, AV-STFP efficiently transfers the well-generalized image knowledge into the audio-visual domain with minimal costs. Extensive experiments on the various audio-visual understanding tasks indicate that AV-STFP achieves competitive or even superior performance to state-of-the-art methods while involving minimal trainable parameters ($i.e.$ $0.6\%$ of model parameters).

Citation

@inproceedings{Wang_2025_BMVC,
author    = {Kai Wang and Shentong Mo and Yapeng Tian and Dimitrios Hatzinakos},
title     = {Prompt Image to Watch and Hear: Multimodal Prompting for Parameter-Efficient Audio-Visual Learning},
booktitle = {36th British Machine Vision Conference 2025, {BMVC} 2025, Sheffield, UK, November 24-27, 2025},
publisher = {BMVA},
year      = {2025},
url       = {https://bmva-archive.org.uk/bmvc/2025/assets/papers/Paper_949/paper.pdf}
}

Copyright © 2025 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection

body { background-color: white !important; color: black !important; }Prompt Image to Watch and Hear: Multimodal Prompting for Parameter-Efficient Audio-Visual Learning

Kai Wang (University of Toronto), Shentong Mo (Carnegie Mellon University), Yapeng Tian (University of Texas at Dallas), Dimitrios Hatzinakos (University of Toronto)

Kai Wang (University of Toronto), Shentong Mo (Carnegie Mellon University), Yapeng Tian (University of Texas at Dallas), Dimitrios Hatzinakos (University of Toronto)

The 35th British Machine Vision Conference

Abstract

Citation

Prompt Image to Watch and Hear: Multimodal Prompting for Parameter-Efficient Audio-Visual Learning

The 35^th British Machine Vision Conference