PhD Proposal: Advancing Audio Intelligence for Perception, Reasoning, and Generation

Talk
Sonal Kumar
Time: 
11.24.2025 13:30 to 14:30

Audio - spanning speech, music, and environmental sound - is central to perception yet remains underused in AI, limiting truly multimodal systems. With the rise of Large Audio Language Models (LALMs) that unify perception, reasoning, and generation in a single architecture, replacing the need for distinct models for distinct models for different foundational tasks like ASR, captioning, etc and enabling tasks from Question-Answering to controllable audio generation.
In this talk, I present my research to date, which addresses several of these bottlenecks through novel models, datasets, and task formulations. I also outline future directions aimed at developing more robust, generalizable, and unified reasoning and generation capable audio-language models.