Towards Auditory General Intelligence
IRB 0318 (Gannon) or https://umd.zoom.us/j/93754397716?pwd=GuzthRJybpRS8HOidKRoXWcFV7sC4c.1
Perception of audio events, music and speech plays a fundamental role in human interaction with the world. Auditory perception is equally vital for an animal's life and survival. I will first briefly review fundamental problems in computational audition, which, along with the rest of this talk, will also be the subject of my spring 2026 course, CMSC848U: Computational Audition.
Large language models have absorbed vast amounts of knowledge, and the scientific community is rapidly working to leverage this in a variety of domains—far beyond chat or coding. However, language models currently lag in auditory scene understanding, speech and nonspeech voiced communication, and music analysis, all of which are central facets of human intelligence.Over the past three years, Large Audio Language Models (LALMs)—which process audio inputs via text queries—have grown increasingly capable. These models are built by creating shared representations for language and audio, and fine-tuning language models with supervised learning. I will discuss active research in this area, including significant contributions from our group at UMD, particularly Sreyan Ghosh and Sonal Kumar (co-advised by Prof. Manocha); and from our summer project at the JSALT 2025 Workshop. I will present an overview of our models: COMPA (ICLR 2024, spotlight), GAMA (EMNLP 2024, oral), Audio Flamingo 2 (ICML 2025), Audio Flamingo 3 (NeurIPS 2025, spotlight), Music Flamingo (submitted; arXiv). AF3 and MF are the leading open-source models for audio, speech and music understanding. I will describe the training process and highlight open research questions, such as extending these models to multichannel and spatial audio, and incorporating reinforcement learning (RL).
Benchmarking has been crucial for LLM development, yet benchmarks for LALMs were absent. Our group created MMAU (ICLR 2025, spotlight), the first comprehensive audio benchmark, now widely used for evaluating LALMs. To create an even more comprehensive benchmark, we enlisted numerous experts at JSALT 2025 and developed MMAUPro (arXiv 2025). I will conclude with thoughts on advancing foundation models for audio and other physical signal domains.