MS Defense: Learning Representations for Audio

Talk

Siddhi Patil

Time:

05.18.2026 11:00 to 12:30

Location:

IRB 4137

URL:

https://talks.cs.umd.edu/talks/4622

Learning audio representations that are useful for both spatial reasoning and semantic understanding remains challenging, where models tailored towards spatial audio often overfit to specific array geometries or phase-based features, while large-scale audio encoders typically ignore the structure of the sound field and focus solely on content. Recent self-supervised approaches that predict latent targets in feature space instead of reconstructing raw signals offer a promising alternative, but have yet to fully reconcile these spatial and semantic requirements.
In this work, we analyze the representations learned by self-supervised approaches and design self-supervised pretext tasks directly on unstructured data to bias the model towards learning representations useful for a wider set of tasks. While operating on unstructured data introduces additional computational overhead compared to using pre-computed features, we mitigate this by replacing standard softmax attention with a normalized attention mechanism, reducing the cost of transformer-based encoders while preserving their ability to model long-range dependencies in spectrogram inputs.
We evaluate the resulting spatial audio representations on both spatial and non-spatial benchmarks, showing that motion-informed pre-training with efficient attention mechanisms can approach the performance of larger, label-intensive baselines while requiring far less explicit supervision. These results suggest that combining spatial inductive biases with computationally efficient latent prediction is a promising path toward unified, data-efficient audio representations that are useful for a broad family of downstream tasks.