UMD Researchers Release GAMA, an LLM with Advanced Audio Understanding

GAMA processes non-speech sounds and non-verbal speech to provide detailed responses.
Descriptive image for UMD Researchers Release GAMA, an LLM with Advanced Audio Understanding

Imagine robots that can listen to every sound and interpret its meaning, from the rustling of leaves to the hum of a distant engine. Envision machines that not only recognize spoken words but also understand the emotional nuances in a baby's cry or the urgency in a fire alarm. This scenario is closer to reality thanks to innovative research at the University of Maryland’s Departments of Computer Science and Electrical & Computer Engineering. 

Leading this breakthrough, researchers at UMD have unveiled GAMA, a large language model (LLM) capable of understanding and processing various non-speech sounds and non-verbal speech. This model represents a significant development in audio understanding, offering detailed responses to complex queries based on audio inputs.

The research team behind GAMA includes nine experts in the field, among them University of Maryland students Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, S Sakshi and Utkarsh Tyagi, as well as UMD computer science faculty members and project advisors Professor Ramani Duraiswami and Distinguished University Professor Dinesh Manocha. Adobe, a leading digital media and marketing solutions company, is also a significant contributor to the project.

“GAMA is among the first audio, large language models to perform complex reasoning from audio or sound signals,” Manocha said. “This includes capabilities for audio scene understanding of various sounds and their context. It can be combined with multimodal LLMs to enhance their capabilities.”

A leap in audio understanding

GAMA, short for General-purpose Large Audio-Language Model, integrates a language model with multiple audio representations. These representations are derived from a custom-built Audio Q-Former, a multi-layer aggregator that compiles features from various layers of an audio encoder. This allows GAMA to have a nuanced understanding of audio inputs.

Understanding non-speech sounds and non-verbal speech is essential for making decisions in various environments. These auditory cues are vital in everyday life. 

“We encounter many non-verbal cues in our daily lives, from the hum of a refrigerator to the warning sounds of an approaching vehicle,” Manocha shared. “Our goal with GAMA is to create a tool to interpret these sounds and provide meaningful responses.”

Developing GAMA

The team fine-tuned GAMA on a large-scale audio-language dataset, enhancing its capability to understand complex audio. They also introduced CompA-R (Instruction-Tuning for Complex Audio Reasoning), a synthetically generated dataset designed to challenge the model with scenarios that require sophisticated reasoning based on audio inputs.

“To understand audio (non-speech sound), we need to know how it is composed,” Duraiswami said. “Machine learning algorithms need to be trained on data that include such labeling. A major service this work provides to the community is a dataset that includes labeled audio samples and algorithms for training a large compositional audio model.”

A crucial aspect of GAMA’s development was the addition of a soft prompt that provides high-level semantic evidence using event tags from the input audio. This enables the model to perform at a higher level of reasoning and understanding.

To evaluate GAMA, the researchers developed the CompA-R-test, a human-labeled dataset designed to test the model’s ability to answer open-ended questions about audio inputs. Both automated tests and expert evaluations showed that GAMA outperformed existing models, with performance improvements ranging from 1% to 84% across various tasks.

"ChatGPT revolutionized the way humans interact with AI in their daily lives," Ghosh said. "While the free version can process image and text inputs, it cannot yet handle audio inputs. Non-verbal sounds, like a car honking or a crowd clapping, are harder to process than verbal speech, which can be transcribed into text.”

Future of Bots 

GAMA’s advanced capabilities open up new possibilities for applications in various fields. The potential uses for GAMA are vast, from assistive technologies for the visually impaired to surveillance systems that can interpret environmental sounds.

“GAMA takes a significant leap in enabling AI to understand sounds crucial for real-world interactions,” Ghosh shared. “It can be used in traffic control, bird call identification, animal distress identification and emergency response. GAMA paves the way for robots that can see and hear, offering a richer understanding of their environment."

—Story by Samuel Malede Zewdu, CS Communications                                                                     

                                                                                        ###

Other contributors to GAMA include:

Oriol Nieto (Adobe)

Ashish Seth (Indian Institute of Technology, Madras)
 

The Department welcomes comments, suggestions and corrections.  Send email to editor [-at-] cs [dot] umd [dot] edu.