V2A-DPO: Omni-Preference Optimization For Video-to-Audio Generation

Abstract

This paper introduces V2A-DPO, a novel Direct Preference Optimization (DPO) framework tailored for flow-based video-to-audio generation (V2A) models, incorporating key adaptations to effectively align generated audio with human preferences. Our approach incorporates three core innovations: (1) AudioScore—a comprehensive human preference-aligned scoring system for assessing semantic consistency, temporal alignment, and perceptual quality of synthesized audio; (2) an automated AudioScore-driven pipeline for generating large-scale preference pair data for DPO optimization; (3) a curriculum learning-empowered DPO optimization strategy specifically tailored for flow-based generative models. Experiments on benchmark VGGSound dataset demonstrate that human-preference aligned Frieren and MMAudio using V2A-DPO outperform their counterparts optimized using Denoising Diffusion Policy Optimization (DDPO) as well as pre-trained baselines. Furthermore, our DPO-optimized MMAudio achieves state-of-the-art performance across multiple metrics, surpassing published V2A models.

I. V2A-DPO Framework II. MMAudio and Frieren Models III. Experimental Setup IV. Demos V. Reference

I. V2A-DPO Framework

Figure 1: Our proposed V2A-DPO framework.

(a) AudioScore - a comprehensive human preference-aligned scoring system for assessing semantic consistency, temporal alignment, and perceptual quality of generated audio.
(b) Omni-Preference Pair Data Generation - automated pipeline for generating preference pairs.
(c) Curriculum Learning-Empowered DPO - tailored optimization strategy for flow-based models.

II. MMAudio and Frieren Models

MMAudio [1]

Figure 2: MMAudio model architecture

Frieren [2]

Figure 3: Frieren model architecture

III. Experimental Setup

Datasets: All our experiments are performed on the constructed omni-preference pair data based on the VGGSound [3] dataset. For a fair comparison, we evaluate our models on the same test set of MMAudio [1] due to data contamination.
We conduct our experiments by using two pre-trained V2A flow-based models: MMAudio-L-44.1kHz and Frieren with the parameters of 1.03B and 159M, respectively.
We compare our optimized V2A models against five state-of-the-art models: diffusion-based Seeing&Hearing [4], FoleyCrafter [5], V2A-Mapper [6], and autoregression-based V-AURA [7], Thinksound [8]. Additionally, we compare DPO with another reinforcement learning method, namely DDPO [9], with the probability of "Good" predicted by AudioScore as the reward.

IV. Demos

⏳ Due to the large number of videos, you may need to wait for a while. Thank you!

Demos from FoleyCrafter Demo Page

Demo Type	Model Comparison
Demo Type	Groundtruth	MMAudio	MMAudio-DPO	MMAudio-DDPO	ThinkSound	FoleyCrafter
Visual cues demo (1)
Visual cues demo (2)
Visual cues demo (3)
Visual cues demo (4)
VGGSound test demo (1)
VGGSound test demo (2)
VGGSound test demo (3)
VGGSound test demo (4)
Video with prompt "seagulls"	None
Video with prompt "noisy, people talking"	None

Manuscript Demo with Prompt "Playing Ukulele"

Groundtruth	MMAudio	MMAudio-DPO	MMAudio-DDPO	ThinkSound	Frieren	V2A-Mapper

V. Reference

[1] Ho Kei Cheng et al., "Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis," in CVPR, 2025.

[2] Yongqi Wang et al., "Frieren: Efficient video-to-audio generation with rectified flow matching," NeurIPS, 2024.

[3] Honglie Chen et al., "Vggsound: A large-scale audio-visual dataset.," in ICASSP, 2020.

[4] Yazhou Xing et al., "Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners," in CVPR, 2024.

[5] Yiming Zhang et al., "Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds," arXiv preprint arXiv:2407.01494, 2024.

[6] Heng Wang et al., "V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models," in AAAI, 2024.

[7] Ilpo Viertola et al., "Temporally aligned audio for video with autoregression," in ICASSP, 2025.

[8] Huadai Liu et al., "Thinksound: Chain-of-thought reasoning in multimodal large language models for audio generation and editing," arXiv preprint arXiv:2506.21448, 2025.