This paper introduces V2A-DPO, a novel Direct Preference Optimization (DPO) framework tailored for flow-based video-to-audio generation (V2A) models, incorporating key adaptations to effectively align generated audio with human preferences. Our approach incorporates three core innovations: (1) AudioScore—a comprehensive human preference-aligned scoring system for assessing semantic consistency, temporal alignment, and perceptual quality of synthesized audio; (2) an automated AudioScore-driven pipeline for generating large-scale preference pair data for DPO optimization; (3) a curriculum learning-empowered DPO optimization strategy specifically tailored for flow-based generative models. Experiments on benchmark VGGSound dataset demonstrate that human-preference aligned Frieren and MMAudio using V2A-DPO outperform their counterparts optimized using Denoising Diffusion Policy Optimization (DDPO) as well as pre-trained baselines. Furthermore, our DPO-optimized MMAudio achieves state-of-the-art performance across multiple metrics, surpassing published V2A models.
Figure 1: Our proposed V2A-DPO framework.
Figure 2: MMAudio model architecture
Figure 3: Frieren model architecture
⏳ Due to the large number of videos, you may need to wait for a while. Thank you!
| Demo Type | Model Comparison | |||||
|---|---|---|---|---|---|---|
| Groundtruth | MMAudio | MMAudio-DPO | MMAudio-DDPO | ThinkSound | FoleyCrafter | |
| Visual cues demo (1) | ||||||
| Visual cues demo (2) | ||||||
| Visual cues demo (3) | ||||||
| Visual cues demo (4) | ||||||
| VGGSound test demo (1) | ||||||
| VGGSound test demo (2) | ||||||
| VGGSound test demo (3) | ||||||
| VGGSound test demo (4) | ||||||
| Video with prompt "seagulls" | None | |||||
| Video with prompt "noisy, people talking" | None | |||||
| Groundtruth | MMAudio | MMAudio-DPO | MMAudio-DDPO | ThinkSound | Frieren | V2A-Mapper |
|---|---|---|---|---|---|---|
[1] Ho Kei Cheng et al., "Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis," in CVPR, 2025.
[2] Yongqi Wang et al., "Frieren: Efficient video-to-audio generation with rectified flow matching," NeurIPS, 2024.
[3] Honglie Chen et al., "Vggsound: A large-scale audio-visual dataset.," in ICASSP, 2020.
[4] Yazhou Xing et al., "Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners," in CVPR, 2024.
[5] Yiming Zhang et al., "Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds," arXiv preprint arXiv:2407.01494, 2024.
[6] Heng Wang et al., "V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models," in AAAI, 2024.
[7] Ilpo Viertola et al., "Temporally aligned audio for video with autoregression," in ICASSP, 2025.
[8] Huadai Liu et al., "Thinksound: Chain-of-thought reasoning in multimodal large language models for audio generation and editing," arXiv preprint arXiv:2506.21448, 2025.