Neural networks consist of layers, each of which holds features that activate in response to certain patterns in the input. For image-based tasks, networks have been studied using feature visualization, which produces interpretable images that stimulate the response of each feature map individually. Visualization methods help us understand and interpret what networks “see.” In particular, they elucidate the layer-dependent semantic meaning of features, with shallow features representing edges and deep features representing objects. While this approach has been quite effective for vision models, our understanding of networks for processing auditory inputs, such as automatic speech recognition models, is more limited because their inputs are not visual.

In this article, we seek to understand what automatic speech recognition networks hear. To this end, we consider methods to sonify, rather than visualize, their feature maps. Sonification is the process of using sounds (as opposed to visuals) to apprehend complex structures. By listening to audio, we can better understand what networks respond to, and how this response varies with depth and feature index. We hear that shallow features respond strongly to simple sounds and phonemes, while deep layers react to complex words and word parts. We also observe that shallow layers preserve speaker-specific nuisance variables like pitch, while the deep layers eschew these attributes and focus on the semantic content of speech.

Check out the interactive blog post below!

The Sonifier