PhD Proposal: Source Separation and Sound Field Decomposition for Acoustic Scene Analysis
We study two related problems for analyzing acoustic scenes. Inspired by the human auditory system's ability to focus on a particular sound stimulus from a mixture of other sounds, commonly referred to as the 'cocktail party effect', audio source separation aims to automatically extract individual sound sources from a mixture of them. Related to source separation is sound field decomposition, which is the task of decomposing a recorded acoustic scene into components, often with the goal of recovering the spatial sound field. As there are several decomposition approaches, for the proposed research we will focus on those that model the locations and signals of individual sound sources, as such approaches share the most similarities to source separation. Applications of such problems include spatial audio reproduction, virtual and augmented reality, teleconferencing, robotics, speech recognition, and music information retrieval. Based on the task, input, and desired output domains, many variations of this problem can be formulated and studied. The goal of the proposed research is to explore several different variants of these formulations, looking either to address open problems from existing work or investigate variations of the problem that have seen less attention.We consider separating the sources of a single-channel audio mixture. In this case, little to no spatial information can be utilized, and methods must thus rely only on spectrotemporal cues and priors. Many deep learning-based approaches have been proposed in recent years to tackle such scenarios, and have made great strides in improving performance on a variety of separation tasks. However, it remains difficult to extend the application of such systems to perform separation on unseen tasks. Moreover, analysis of some failure modes of these systems suggest room for improvement in regards to robustness. We propose to explore improving existing source separation models along these fronts, deriving inspiration from psychoacoustics and sequential modeling.On the other side of the problem spectrum, we also consider decomposing an acoustic scene captured by a microphone array. Such a scenario allows methods to utilize spectral, temporal, and spatial cues to estimate the locations and emitted audio signals of sources, and perhaps even information about the environment. While this problem is well-studied using signal processing and estimation theory techniques, it remains challenging in reverberant environments. We will investigate the use of deep learning and acoustic simulation techniques in establishing spatial priors to tackle these challenges.
Dr. Ramani Duraiswami
Dr. Dinesh Manocha
Dr. Matthias Zwicker