UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities

Xuenan Xu1,*,† Jiahao Mei2,5,* Zihao Zheng1,2 Zihao Zheng Ye Tao1,2 Ye Tao Zeyu Xie3 Yaoyun Zhang2 Yaoyun Zhang Haohe Liu4 Yuning Wu5 Yuning Wu Ming Yan5 Ming Yan Wen Wu1 Chao Zhang1 Mengyue Wu2
1Shanghai Artificial Intelligence Lab 2Shanghai Jiao Tong University 3Peking University 4Meta 5Alibaba Group
*Equal Contribution Project Lead

Abstract

Audio generation, including speech, music and sound effects, has advanced rapidly in recent years. These tasks can be divided into two categories: time-aligned (TA) tasks, where each input unit corresponds to a specific segment of the output audio (e.g., phonemes aligned with frames in speech synthesis); and non-time-aligned (NTA) tasks, where such alignment is not available. Since modeling paradigms for the two types are typically different, research on different audio generation tasks has traditionally followed separate trajectories. However, audio is not inherently divided into such categories, making a unified model a natural and necessary goal for general audio generation. Previous unified audio generation works have adopted autoregressive architectures, while unified non-autoregressive approaches remain largely unexplored. In this work, we propose UniFlow-Audio, a universal audio generation framework based on flow matching. We propose a dual-fusion mechanism that temporally aligns audio latents with TA features and integrates NTA features via cross-attention in each model block. Task-balanced data sampling is employed to maintain strong performance across both TA and NTA tasks. UniFlow-Audio supports omni-modalities, including text, audio, and video. By leveraging the advantage of multi-task learning and the generative modeling capabilities of flow matching, UniFlow-Audio achieves strong results across 7 tasks using fewer than 8K hours of public training data and under 1B trainable parameters. Even the small variant with only ~200M parameters shows competitive performance, highlighting UniFlow-Audio as a potential non-auto-regressive foundation model for audio generation. Code and models will be available at https://wsntxxn.github.io/uniflow_audio.

Method Overview

Method Overview Diagram

Overview of UniFlow-Audio. The content encoder and adapter transforms the input and task instruction to content embedding. Based on the predicted duration, the content embedding is expanded to time-aligned content embedding. A dual-fusion mechanism is applied: the latent is fused with the content by cross attention, and fused with time-aligned content by addition.

Sample Audio


🎧 Text-to-Audio (T2A)

Prompt Generated Audio
"pigeons coo and rustle."
"A man talking followed by a toilet flushing."
"An adult male speaks, and a crowd laughs and then cheers and applauds."
"Music in the background as a women speaks and food fries."
"A train is passing by and sound its whistle."

🎼 Text-to-Music (T2M)

Instruction (click to expand) Generated Audio
The song is an instrumental. The song is medium tempo played by a solo guitarist……
The song is an instrumental. The song is medium tempo played by a solo guitarist on a vintage super tone with an exquisite tone. The song is emotional and passionate. The song has poor audio quality.
The low quality recording features a Christmas song that consists of a widely spread, groovy piano melody...
The low quality recording features a Christmas song that consists of a widely spread, groovy piano melody, followed by synth choir keys. It sounds happy, joyful, fun and euphoric - as any Christmas song should sound.
This audio contains complex and fast acoustic drums with a lot of cymbal hits along with a tambourine shaker...
This audio contains complex and fast acoustic drums with a lot of cymbal hits along with a tambourine shaker. E-guitars and e-bass are playing along to the drums. This piece is in a 7/4 time signature. This song may be playing at a live rock concert.
This is a Hindu music piece. There is a female vocalist singing at a medium-to-high pitch in a devotional manner...
This is a Hindu music piece. There is a female vocalist singing at a medium-to-high pitch in a devotional manner. A sitar provides a melodic background. The tabla is being played in the rhythmic background. The atmosphere is spiritual. This piece could be played at religious events and online content related to Hindu religion.
The low quality recording features a mellow arpeggiated piano melody...
The low quality recording features a mellow arpeggiated piano melody over which there is a theremin solo melody playing. It sounds sad, emotional and passionate. The recording is noisy.

🔊 Super-Resolution (SR)

Low Sampling Rate

High Sampling Rate

🎤 Speech Enhancement (SE)

Noisy Enhanced

📹 Video-to-Audio (V2A)

The input videos are muted, while the versions shown here include audio generated by our model

Input Video (Muted) Video with Generated Audio

🎙 Singing Voice Synthesis (SVS)

Lyrics Audio
看过黄昏追逐黎明 没看过你
听过天空拒绝飞鸟
能够拥抱的就别拉扯 时间着急的
我想了很久 我开始慌了
一杯敬故乡 一杯敬远方

🗣️ Text-to-Speech (TTS)

Content Reference (Timbre) Generated Audio
Other circumstances permitting, that instinct disposes men to look with favor upon productive efficiency and on whatever is of human use.
Marie looked up at his defiant figure and her face clouded.
I entered, and I took you into my confidence as to the suggestions of the side table.
Then Mary Taylor, whose conscience was uncomfortable, said:
Soon after the station went into operation this ingenious plan was changed, and the third dynamo was replaced by two others.