V2A-DPO Project Icon

V2A-DPO: Omni-Preference Optimization For Video-to-Audio Generation

Authors: Nolan Chan, Timmy Gang, Yongqian Wang, Yuzhe Liang, Dingdong Wang

GitHub Project Link

Abstract

This paper introduces V2A-DPO, a novel Direct Preference Optimization (DPO) framework tailored for flow-based video-to-audio generation (V2A) models, incorporating key adaptations to effectively align generated audio with human preferences. Our approach incorporates three core innovations: (1) AudioScore—a comprehensive human preference-aligned scoring system for assessing semantic consistency, temporal alignment, and perceptual quality of synthesized audio; (2) an automated AudioScore-driven pipeline for generating large-scale preference pair data for DPO optimization; (3) a curriculum learning-empowered DPO optimization strategy specifically tailored for flow-based generative models. Experiments on benchmark VGGSound dataset demonstrate that human-preference aligned Frieren and MMAudio using V2A-DPO outperform their counterparts optimized using Denoising Diffusion Policy Optimization (DDPO) as well as pre-trained baselines. Furthermore, our DPO-optimized MMAudio achieves state-of-the-art performance across multiple metrics, surpassing published V2A models.

I. V2A-DPO Framework

Architecture of V2A-DPO

Figure 1: Our proposed V2A-DPO framework.

II. MMAudio and Frieren Models

MMAudio [1]

Architecture of MMAudio model

Figure 2: MMAudio model architecture

Frieren [2]

Architecture of Frieren model

Figure 3: Frieren model architecture

III. Experimental Setup

IV. Demos

⏳ Due to the large number of videos, you may need to wait for a while. Thank you!

Demos from FoleyCrafter Demo Page

Demo Type Model Comparison
Groundtruth MMAudio MMAudio-DPO MMAudio-DDPO ThinkSound FoleyCrafter
Visual cues demo (1)
Visual cues demo (2)
Visual cues demo (3)
Visual cues demo (4)
VGGSound test demo (1)
VGGSound test demo (2)
VGGSound test demo (3)
VGGSound test demo (4)
Video with prompt "seagulls" None
Video with prompt "noisy, people talking" None

Manuscript Demo with Prompt "Playing Ukulele"

Groundtruth MMAudio MMAudio-DPO MMAudio-DDPO ThinkSound Frieren V2A-Mapper

V. Reference

[1] Ho Kei Cheng et al., "Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis," in CVPR, 2025.

[2] Yongqi Wang et al., "Frieren: Efficient video-to-audio generation with rectified flow matching," NeurIPS, 2024.

[3] Honglie Chen et al., "Vggsound: A large-scale audio-visual dataset.," in ICASSP, 2020.

[4] Yazhou Xing et al., "Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners," in CVPR, 2024.

[5] Yiming Zhang et al., "Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds," arXiv preprint arXiv:2407.01494, 2024.

[6] Heng Wang et al., "V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models," in AAAI, 2024.

[7] Ilpo Viertola et al., "Temporally aligned audio for video with autoregression," in ICASSP, 2025.

[8] Huadai Liu et al., "Thinksound: Chain-of-thought reasoning in multimodal large language models for audio generation and editing," arXiv preprint arXiv:2506.21448, 2025.