Audio-Guided Visual Editing with Complex Multi-Modal Prompts

Hyeonyu Kim (MAUM AI Inc.), Seokhoon Jeong (Ulsan National Institute of Science and Technology), Seonghee Han (Ulsan National Institute of Science and Technology), Chanhyuk Choi (Ulsan National Institute of Science and Technology), Taehwan Kim (Ulsan National Institute of Science and Technology)

The 35^th British Machine Vision Conference

PDF Poster Video (Right click to download)Supplementary

Abstract

Visual editing with diffusion models has made significant progress but often struggles with complex scenarios that textual guidance alone could not adequately describe, highlighting the need for additional non-text editing prompts. In this work, we introduce a novel audio-guided visual editing framework that can handle complex editing tasks with multiple text and audio prompts without requiring additional training. Existing audio-guided visual editing methods often necessitate training on specific datasets to align audio with text, limiting their generalization to real-world situations. We leverage a pre-trained multi-modal encoder with strong zero-shot capabilities and integrate diverse audio into visual editing tasks, by alleviating the discrepancy between the audio encoder space and the diffusion model's prompt encoder space. Additionally, we propose a novel approach to handle complex scenarios with multiple and multi-modal editing prompts through our separate noise branching and adaptive patch selection. Our comprehensive experiments on diverse editing tasks demonstrate that our framework excels in handling complicated editing scenarios by incorporating rich information from audio, where text-only approaches fail.

Citation

@inproceedings{Kim_2025_BMVC,
author    = {Hyeonyu Kim and Seokhoon Jeong and Seonghee Han and Chanhyuk Choi and Taehwan Kim},
title     = {Audio-Guided Visual Editing with Complex Multi-Modal Prompts},
booktitle = {36th British Machine Vision Conference 2025, {BMVC} 2025, Sheffield, UK, November 24-27, 2025},
publisher = {BMVA},
year      = {2025},
url       = {https://bmva-archive.org.uk/bmvc/2025/assets/papers/Paper_755/paper.pdf}
}

Copyright © 2025 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection

body { background-color: white !important; color: black !important; }Audio-Guided Visual Editing with Complex Multi-Modal Prompts

Hyeonyu Kim (MAUM AI Inc.), Seokhoon Jeong (Ulsan National Institute of Science and Technology), Seonghee Han (Ulsan National Institute of Science and Technology), Chanhyuk Choi (Ulsan National Institute of Science and Technology), Taehwan Kim (Ulsan National Institute of Science and Technology)

Hyeonyu Kim (MAUM AI Inc.), Seokhoon Jeong (Ulsan National Institute of Science and Technology), Seonghee Han (Ulsan National Institute of Science and Technology), Chanhyuk Choi (Ulsan National Institute of Science and Technology), Taehwan Kim (Ulsan National Institute of Science and Technology)

The 35th British Machine Vision Conference

Abstract

Citation

Audio-Guided Visual Editing with Complex Multi-Modal Prompts

The 35^th British Machine Vision Conference