MS Defense: Learning Representations for Audio

Talk
Siddhi Patil
Time: 
05.08.2026 14:30 to 16:00
Location: 

IRB 4137

Learning audio representations that are useful for both spatial reasoning and semantic understanding remains challenging, where models tailored towards spatial audio often overfit to specific array geometries or phase-based features, while large-scale audio encoders typically ignore the structure of the sound field and focus solely on content. Recent self-supervised approaches that predict latent targets in feature space instead of reconstructing raw signals offer a promising alternative, but have yet to fully reconcile these spatial and semantic requirements.
In this work, we analyze the representations learned by self-supervised approaches and design self-supervised pretext tasks directly on unstructured data to bias the model towards learning representations useful for a wider set of tasks. While operating on unstructured data introduces additional computational overhead compared to using pre-computed features, we mitigate this by replacing standard softmax attention with a normalized attention mechanism, reducing the cost of transformer-based encoders while preserving their ability to model long-range dependencies in spectrogram inputs.
We evaluate the resulting spatial audio representations on both spatial and non-spatial benchmarks, showing that motion-informed pre-training with efficient attention mechanisms can approach the performance of larger, label-intensive baselines while requiring far less explicit supervision. These results suggest that combining spatial inductive biases with computationally efficient latent prediction is a promising path toward unified, data-efficient audio representations that are useful for a broad family of downstream tasks.