Teaching AI to Listen: UMD Advances Audio Timestamp Captioning with Adobe Research

In collaboration with Adobe, CS Ph.D. student Sonal Kumar advances audio timestamp captioning to improve how AI understands speech and complex sounds.

March 19, 2026

Descriptive image for Teaching AI to Listen: UMD Advances Audio Timestamp Captioning with Adobe Research

From voice assistants answering questions to automated captions appearing on videos, artificial intelligence increasingly relies on audio to interact with people. Yet many systems still struggle to interpret sound the way humans do, especially when speech includes different accents or when multiple sounds occur at once. At the University of Maryland, Ph.D. student Sonal Kumar is working to address those challenges through timestamped audio captioning—an emerging approach that helps AI understand not only what it hears, but also when sounds occur.

Advised by Professors Ramani Duraiswami and Distinguished University Professor Dinesh Manocha, he is conducting recent work on timestamped audio captioning to improve how AI systems identify and interpret audio events in real-world environments.

Developed in close collaboration with Adobe Research, Kumar’s work focuses on improving how large audio-language models identify and describe events in real-world sound environments. His relationship with Adobe began through an internship and expanded into a broader research partnership aimed at bringing advances in audio understanding to real production settings. Kumar said the collaboration with Adobe has been instrumental in shaping both the technical direction of the project and its practical, real-world relevance.

“Working with Adobe gave us the opportunity to think about how these models would actually be used,” Kumar said. “It pushed us to consider both the research challenges and the real-world impact.”

At the center of this work is Audio Timestamp Captioning, which goes beyond traditional audio captioning by assigning both descriptive labels and temporal boundaries to sounds. Rather than simply identifying that speech, music or background noise is present, the model aims to determine precisely when each event starts and ends, even when multiple sounds overlap.

“In AI, everything depends on data,” Kumar said. “Models have been very successful with text and images because they have large labeled datasets. With audio, even when recordings exist, we often do not have detailed labels describing what is happening and when.”

That gap is what Audio Timestamp Captioning seeks to fill. By pairing language with timing information, the approach helps AI systems distinguish overlapping events, such as two people speaking at once or speech occurring alongside environmental noise. This kind of fine-grained understanding is essential for interpreting the structure and dynamics of real-world audio.

“If two people are talking, their voices can overlap,” Kumar said. “You need to detect the start and end of each event, even when they occur together. Understanding that structure is important if systems are going to interpret the dynamics of real environments.”

The implications are significant. Audio Timestamp Captioning could support real-time, accurately timed subtitles for media, improving accessibility for deaf and hard-of-hearing audiences. It also has potential applications in video editing, content indexing, interactive media systems and future AI assistants that need to make sense of complex soundscapes.

To showcase this work, Kumar created an interactive project site featuring audio and video demos that illustrate Audio Timestamp Captioning in action. Those demonstrations offer one of the clearest and most compelling windows into the research and its potential.

Kumar’s interest in audio AI grew from personal experience with voice interfaces that struggled to recognize speech variations, particularly across accents. As someone from India, he saw firsthand how current systems often fail in linguistically diverse settings.

“In India, there are many accents, even within the same language,” Kumar said. “Systems often get confused when people from different regions say the same word.”

Beyond speech recognition, Kumar studies how AI interprets complex sound environments where conversations, background noise and ambient sounds overlap. These layered soundscapes remain difficult for current models to parse, making audio one of the more challenging frontiers in artificial intelligence.

“Audio has always been the native modality in which we communicate, but it has not received the attention that it deserves,” Kumar said. “We are trying to make audio a first-class component so AI assistants can understand sound more like humans do.”

Kumar describes the long-term goal of this work as achieving “audio general intelligence,” where AI systems can perceive, understand and generate audio with fluency comparable to human listening and speaking.

He said his experience in UMD’s computer science program has helped support that work.

“The research-based curriculum at UMD helps prepare students to work on challenging problems,” Kumar said. “We have the resources and advising that help us think about problems the research community is trying to solve.”

Looking ahead, Kumar said audio-aware AI systems could play a role in technologies such as wearable devices and robotics. Smart glasses, for example, may rely on audio understanding to provide real-time transcription or translation, while robots could use sound cues to better interpret their surroundings.

“With the pace at which the field is moving,” Kumar said, “we may begin to see these capabilities emerging within the next few years.”

###

Co-authors on the work from Adobe include Prem Seetharaman, Ke Chen, Oriol Nieto, Jiaqi Su, Zhepei Wang, Rithesh Kumar, Nicholas J. Bryan, Zeyu Jin and Justin Salamon, and Dinesh Manocha of the University of Maryland.

—Story by Samuel Malede Zewdu, CS Communications

The Department welcomes comments, suggestions and corrections. Send email to editor [-at-] cs [dot] umd [dot] edu.