PhD Proposal: Advancing Audio Intelligence for Perception, Reasoning, and Generation

Talk

Sonal Kumar

Time:

11.24.2025 13:30 to 14:30

Location:

https://umd.zoom.us/j/3073988210?pwd=OUpOeXhRN05ueXBwZ0JMNkRPbWZ6Zz09&om...

URL:

https://talks.cs.umd.edu/talks/4423

Audio - spanning speech, music, and environmental sound - is central to perception yet remains underused in AI, limiting truly multimodal systems. With the rise of Large Audio Language Models (LALMs) that unify perception, reasoning, and generation in a single architecture, replacing the need for distinct models for distinct models for different foundational tasks like ASR, captioning, etc and enabling tasks from Question-Answering to controllable audio generation.
In this talk, I present my research to date, which addresses several of these bottlenecks through novel models, datasets, and task formulations. I also outline future directions aimed at developing more robust, generalizable, and unified reasoning and generation capable audio-language models.