Diffusion Transformer-to-Mamba Distillation for High-Resolution Image Generation

Yuan Yao (University of Rochester), Yicong Hong (Adobe Research), Difan Liu (Adobe Research), Long Mai (Adobe Research), Feng Liu (Adobe Research), Jiebo Luo (University of Rochester)

The 35^th British Machine Vision Conference

PDF Poster Video (Right click to download)Supplementary

Abstract

The quadratic computational complexity of self-attention in diffusion transformers (DiT) introduces substantial computational costs in high-resolution image generation. While the linear-complexity Mamba model emerges as a potential alternative, direct Mamba training remains empirically challenging. To address this issue, this paper introduces diffusion transformer-to-mamba distillation (T2MD), forming an efficient training pipeline that facilitates the transition from the self-attention-based transformer to the linear complexity state-space model Mamba. We establish a diffusion self-attention and Mamba hybrid model that simultaneously achieves efficiency and global dependencies. With the proposed layer-level teacher forcing and feature-based knowledge distillation, T2MD alleviates the training difficulty and high cost of a state space model from scratch. Starting from the distilled 512 $\times$ 512 resolution base model, we push the generation towards 2048 $\times$ 2048 images via lightweight adaptation and high-resolution fine-tuning. Experiments demonstrate that our training path leads to low overhead but high-quality text-to-image generation. Importantly, our results also justify the feasibility of using sequential and causal Mamba models for generating non-causal visual output, suggesting the potential for future exploration.

Citation

@inproceedings{Yao_2025_BMVC,
author    = {Yuan Yao and Yicong Hong and Difan Liu and Long Mai and Feng Liu and Jiebo Luo},
title     = {Diffusion Transformer-to-Mamba Distillation for High-Resolution Image Generation},
booktitle = {36th British Machine Vision Conference 2025, {BMVC} 2025, Sheffield, UK, November 24-27, 2025},
publisher = {BMVA},
year      = {2025},
url       = {https://bmva-archive.org.uk/bmvc/2025/assets/papers/Paper_931/paper.pdf}
}

Copyright © 2025 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection

body { background-color: white !important; color: black !important; }Diffusion Transformer-to-Mamba Distillation for High-Resolution Image Generation

Yuan Yao (University of Rochester), Yicong Hong (Adobe Research), Difan Liu (Adobe Research), Long Mai (Adobe Research), Feng Liu (Adobe Research), Jiebo Luo (University of Rochester)

Yuan Yao (University of Rochester), Yicong Hong (Adobe Research), Difan Liu (Adobe Research), Long Mai (Adobe Research), Feng Liu (Adobe Research), Jiebo Luo (University of Rochester)

The 35th British Machine Vision Conference

Abstract

Citation

Diffusion Transformer-to-Mamba Distillation for High-Resolution Image Generation

The 35^th British Machine Vision Conference