April 28th, 2026
Abstract: Recent audio models have significantly improved performance on tasks such as automatic speech recognition, but important challenges still remain: models can struggle on languages with limited data, modalities such as vision are not naturally incorporated, and modern audio large language models still require better reasoning over diverse audio. In this thesis, we propose audio understanding models that effectively combine information from different modalities including vision and text, perform better across low-resource languages, and maintain performance in noisy audio scenarios.
First, we compare two foundational multilingual speech models, XLS-R and Whisper, for speech recognition on languages that were seen and unseen during pre-training. Through fine-tuning experiments and analysis of the pre-training data, we show that the amount of audio seen per language and language family strongly influences downstream performance and helps explain how the models adapt to new languages. Moreover, we vetted Whisper as a strong model capable of being adapted to new applications.
Second, we propose Whisper-Flamingo, a method that integrates lip-based visual features into Whisper using gated cross attention. This adaptation enables strong noise-robust audio-visual speech recognition on English videos and also supports English-to-multilingual speech translation with a single model. Third, we extend this framework to multilingual videos with mWhisper-Flamingo. We show that transferring the English training recipe directly is insufficient, and we introduce decoder modality dropout to better integrate audio and video across 9 languages. The resulting model achieves state-of-the-art multilingual audio-visual speech recognition and consistently outperforms audio-only baselines in noisy conditions.
Finally, we introduce Omni-R1, a reinforcement learning adaptation of Qwen2.5-Omni for audio question answering. Through controlled experiments with and without audio, we find that a substantial portion of improvements comes from better text-based reasoning, and surprisingly that text-only fine-tuning can still yield large gains on audio benchmarks. By scaling training with automatically generated audio question-answer data, Omni-R1 achieves new state-of-the-art performance.