UMD Team Unveils RECAP for Advanced Audio Captioning

REtrieval-Augmented Audio CAPtioning (RECAP) has a unique ability to caption previously unheard audio events and intricate multi-sound audios.
Descriptive image for UMD Team Unveils RECAP for Advanced Audio Captioning

In a paper submitted to the arxiv server, researchers from the University of Maryland proposed REtrieval-Augmented Audio CAPtioning (RECAP), a novel technique to enhance audio captioning performance when generalizing across domains. The research team, led by Department of Computer Science Professors Ramani Duraiswami and Dinesh Manocha and including master's students Chandra Kiran Reddy Evuru, Sreyan Ghosh and Sonal Kumar, conducted a study that showcases RECAP's performance on benchmark datasets. The study highlights its unique ability to caption previously unheard audio events and intricate multi-sound audios.

Audio captioning aims to generate natural language descriptions for environmental audio content instead of transcribing speech. Mapping audio to text aids several real-world applications across various fields. Most existing methods utilize encoder-decoder architectures with pre-trained audio encoders and text decoders. However, performance degrades significantly on out-of-distribution test domains, limiting usefulness in practice.

The researchers hypothesize that this challenge stems from distribution shifts in audio events across domains. For example, AudioCaps contains sounds like jazz music and interviews, which Clotho does not include. Real-world scenarios also involve emerging audio concepts within a domain over time. The researchers propose the novel RECAP technique to address this issue of poor generalization across domains.

Click HERE to read the full article 

The Department welcomes comments, suggestions and corrections.  Send email to editor [-at-] cs [dot] umd [dot] edu.