PhD Defense: Advancing Audio Processing in the Age of Large Language Models

Talk
Sreyan Ghosh
Time: 
04.20.2026 15:30 to 17:30
Location: 

Understanding audio, encompassing speech, non-speech sounds, and music, is fundamental for AI systems to interact effectively with the world, yet audio processing has historically lagged behind language and vision due to data scarcity, limited architectures, and the inherent complexity of auditory signals. Recent advances in Large Language Models (LLMs) have begun to bridge this gap, demonstrating promising capabilities in tasks ranging from Automatic Speech Recognition and audio captioning to open-ended question answering and complex reasoning. My dissertation advances audio processing in the age of LLMs through contributions in open model development, scalable data curation, robust audio representations, long-form understanding, expert-level evaluation, and omni-modal reasoning.
In this talk, I will present the Audio Flamingo series, a family of fully open large audio-language models we develop with novel architectures, training curricula, and internet-scale data curation strategies, including over 1 million hours of carefully curated audio of varying lengths, paired with skill-specific question-answer pairs, that achieve state-of-the-art results across over 20 benchmarks, surpassing both open-weight and closed-source models. I will discuss unified audio encoders such as AF-CLAP and AF-Whisper, trained on over 8 million audio-caption pairs, that bridge speech, sound, and music representation learning, and describe how we extend audio understanding from short clips to 30-minute contexts through new datasets, temporally grounded reasoning paradigms, and scaled training infrastructure. I will present Music Flamingo, which achieves expert-level music understanding through theory-grounded chain-of-thought reasoning and reinforcement learning, and UALM, which unifies audio understanding, generation, and reasoning within a single model. I will introduce expert-level benchmarks such as MMAU and MMAU-Pro, spanning over 10,000 and 5,000 annotated instances respectively, that reveal significant gaps between current models and human-level audio reasoning, with even the best models achieving only ~75% on MMAU where humans reach ~82%, and just ~58% on the more challenging MMAU-Pro.
Finally, I will present MMOU, a large-scale benchmark of 15,000 QA pairs over 9,000 long-form real-world videos, where we demonstrate that audio intelligence is foundational -- not peripheral -- to video understanding, with even the best proprietary systems falling over 20 points short of human performance. Motivated by this gap, I will present Audio-Visual Flamingo, a fully open audio-visual language model we design to enable temporally grounded reasoning over long and complex real-world videos by jointly integrating audio and visual streams.