Seminars in Visual and Auditory Scene Analysis

University of Maryland

All Seminars will be held in Biology-Psychology Building 1142,

4:30-5:30, on Mondays

Date

Speaker

Institution

Title

Feb. 11 Ramani Duraiswami University of Maryland Creating Virtual Audio Displays

Feb. 18 Yiannis Aloimonos University of Maryland
HAL:Human Activity Language - Introduction to Sensorimotor Linguistics

Feb. 25 Jianbo Shi University of Pennsylvania Visual Thinking with Graph Network

March 3 Ronen Basri Weizmann Institute of Science Algorithmic and Perceptual Aspects of Lighting
March 10 Mike Lewicki Carnegie Mellon University Generalization and Perceptual Organization in Natural Scenes

March 24 Fei-Fei Li Princeton University
Telling the Story of a Scene: From Humans to Computers

March 31 Paul Schrater University of Minnesota The Role of Generative Knowledge in Perception and Action

April 7 Barbara Shinn-Cunningham Boston University
Going with the flow in a cocktail party

April 14 Andrew Oxenham University of Minnesota Encoding the pitch of single and multiple sounds: Implications for auditory scene analysis

April 21 Mounya Elhilali Johns Hopkins University A cocktail party - with a cortical twist

April 28 Rob de Ruyter van Steveninck Indiana University Motion estimation and natural visual signals

May 5 Antonio Torralba Massachusetts Institute of Technology Object recognition by scene alignment

Creating virtual auditory displays

Ramani Duraiswami

Department of Computer Science

University of Maryland

First, the physical mechanisms leading to human perception of the spatial location of a source will be reviewed. Among these are interaural time and level differences, cues due to scattering off the body (encapsulated in the so-called Head Related Transfer Function), scattering off the environment (encapsulated in a Room Transfer Function), and purposive head motion. I will then review the issues in creating virtual auditory display of both single sources and of a scene. Capture of sound to achieve auditory display, and developing efficient algorithms to create the display has been a focus of research in my lab, which I will then discuss.

HAL: Human Activity Language*

An Introduction to Sensorimotor Linguistics

Yiannis Aloimonos

Department of Computer Science

University of Maryland

We propose a linguistic approach to model human activity. This approach is able to address several problems related to action nterpretation in a single framework. The Human Activity Language (HAL) consists of kinetology, morphology, and syntax. Kinetology, the phonology of human movement, finds basic primitives for human motion (segmentation) and associates them with symbols (symbolization). The input is measurements of human movement in 3D (signals), as for example produced by motion capture systems. This way, kinetology provides a non-arbitrary grounded symbolic representation for human movement that allows synthesis, analysis, and symbolic manipulation. The morphology of a human action is related to the inference of essential parts of the movement (morpho-kinetology) and its structure (morpho-syntax). In order to learn the morphemes and their structure, we present a grammatical inference methodology and introduce a parallel learning algorithm to induce a grammar system representing a single action. In practice, morphology is concerned with the construction of a vocabulary of actions or a praxicon. The syntax of human activities involves the construction of sentences using action morphemes. A sentence may range from a single action morpheme (nuclear syntax) to a sequence of sets of morphemes. A single morpheme is decomposed into analogs of lexical categories: nouns, adjectives, verbs, and adverbs. The sets of morphemes represent simultaneous actions (parallel syntax) and a sequence of movements is related to the concatenation of activities (sequential syntax). Nuclear syntax, especially adverbs, is related to the motion interpolation problem, parallel syntax addresses the slicing problem, and sequential syntax is proposed as an alternative method to the transitioning problem. Consequences of the framework to surveillance, automatic video annotation, humanoid robotics and Cognitive Science will be discussed throughout the talk.

*: Joint work with Gutemberg Guerra, Alap Karapurkar, Yi Li

Visual Thinking with Graph Network

Jianbo Shi
Computer and Information Science
University of Pennsylvania

Many visual perception tasks are fundamentally NP-hard computational problems. Solving these problems robustly requires thinking through combinatorially many hypothesis. Despite this, our human visual system performs these tasks effortlessly. How is this done? I would like to make two points on this topic. First, formulating visual thinking as NP-hard computation tasks has an important advantage: visual routines can be analyzed precisely to identify their behaviors independently of their implementations. Second, I will show there is a class of graph optimization problems which can be implemented using a distributed network system with physical (and plausible biological) interpretation.

I will demonstrate this graph based approach for: 1) image segmentation using Normalized Cuts with explanations for illusory contours, visual pop out and attention; 2) salient contour grouping using Untangling Cycle; and 3) contour context selection for shape detection.

This is a joint work with Stella Yu and Qihui Zhu.

Algorithmic and Perceptual Aspects of Lighting
Ronen Basri
Weizmann Institute of Science
Toyota Technological Institute

Variations in lighting can significantly affect the appearance of objects. Understanding lighting is important in order to address problems that require invariance to lighting (e.g., object recognition). Moreover, lighting provides a strong cue from which the 3D shape of objects can be inferred. Modeling lighting can therefore lead to algorithms for shape recovery that can handle objects with smooth, texture-less surfaces that are difficult to handle with other methods. In this talk I will present methods to (1) model the effect of complex lighting on Lambertian surfaces by using spherical harmonic approximations, (2) introduce prior knowledge into algorithms for face reconstruction from single gray-level and two-tone images, and (3) recover the 3D shape of moving objects while avoiding the common brightness constancy assumption.

Generalization and perceptual organization in natural scenes

Mike Lewicki

Computer Science Department and Center for the Neural Basis of Cognition
Carnegie Mellon University

Traditional approaches to visual perceptual organization such as Gestalt "laws" pertain largely to the aggregation of primitive visual
features. Although these principles can explain the perceptual grouping of features like bars and circles, they provide relatively
little insight into how the visual system organizes information in a natural scene. In this talk, I will discuss an alternative approach
based on the idea of forming invariant representations for local regions of the visual scene. I will present a model in which higher-
level neurons encode probability distributions over their inputs to form stable representations, even across complex patterns of
variation. Trained on natural images, the model learns a compact set of visual codes that describe image distributions typically
encountered in local regions of natural scenes. I will show that neurons in the model account for a wide range of non-linear effects
observed in complex cells and neurons in higher visual areas such as V2 and V4. These results provide the first functional explanation of
these response properties and offer insight into the computational problems the visual system must solve to organize the complex visual information in natural scenes.

This is joint work with Yan Karklin.

Telling the Story of a Scene: From Humans to Computers

Fei-Fei Li

Department of Computer Science

Princeton University

For both humans and machines, the ability to learn and recognize the semantically meaningful contents of the visual world is an essential and important functionality. In this talk, we will examine the topic of natural scene categorization and recognition in human psychophysical and physiological experiments as well as in computer vision modeling. I will first present a series of recent human psychophysics studies on natural scene recognition. All these experiments converge to one prominent phenomena of the human visual system: humans are extremely efficient and rapid in capturing the semantic contents of the real-world images. Inspired by these behavioral results, we report a recent fMRI experiment that classifies different types of natural scenes (e.g. beach vs. building vs. forest, etc.) based on the distributed fMRI activity. This is achieved by utilizing a number of pattern recognition algorithms in order to capture the multivariate nature of the complex fMRI data. In the second half of the talk, we begin with a generative Bayesian hierarchical model that learns to categorize natural images in a weakly supervised fashion. We represent an image by a collection of local regions, denoted as codewords obtained by unsupervised clustering. Each region is then represented as part of a `theme'. In previous work, such themes were learnt from hand-annotations of experts, while our method learns the theme distribution as well as the codewords distribution over the themes without such supervision. We report excellent categorization performances on a large set of 13 categories of complex scenes. If time permits, we will show a series of recent works in our lab toward the holistic and integrative analysis of scene understanding.

The Role of Generative Knowledge in Perception and Action

Paul Schrater

Departments of Psychology and Computer Science

University of Minnesota

Generative knowledge denotes the brain's understanding of the causal relationships between scene variables, their typical distributions, and the way scene variables are mapped to sensory data. Generative knowledge of our motor behavior is critical for controlling our actions and predicting their outcomes. Generative knowledge in perception may underlie the high level of functionality despite complex and ambiguous sensory input. Fro example, typical images can consist of dozens or hundreds of objects, many of which are overlapping. Similar 3D object(s) can result in many different images, and different objects can result in similar images. It is well- established that local visual features such as sharp intensity changes are ambiguous; in the absence of context, it is difficult for an algorithm or human to determine whether intensity change in a small region of a natural image reflects a significant object property or irrelevant clutter. In contrast, high-level, everyday human vision is rarely ambiguous, and ambiguities are often easy to resolve when they do exist. How is it done?

Generative knowledge coupled with Bayesian statistical inference provide a coherent theoretical frameworkfor explaining how human perception uses contextual information about scenes and objects to resolve ambiguity. I will present an overview of generative models in human vision and motor control, and describe specific experiments that suggest: 1) Humans use knowledge of causal relationships between variables to incorporate auxiliary information in perceptual decisions 2) Humans can infer the causes of scene properties 3) Humans can use generative knowledge to make decisions about when to collect more perceptual information, and when to act without it.

Going with the flow in a cocktail party

Barbara Shinn-Cunningham

Department of Cognitive and Neural Systems

Boston University

Imagine yourself in a circle of friends at a cocktail party. A cacophonous mixture of multiple talkers reaches your ears, yet you are able to focus attention on the talker of interest and extract what they are saying. Moreover, which talker is the "target" changes unexpectedly and unpredictably, as the banter jumps from talker to talker (from one auditory object to another). How do we make sense of the this kind of complex auditory scene? First we will review how we focus attention on one object in a mixture of similar objects, along with evidence that one (and only one) object is really the focus of attention at any one moment. Then we will consider how, despite the fact that there are costs of switching attention from one auditory object to another, we are able to follow a conversation in an unpredictable social setting. Finally, we will consider how even modest hearing lost can interfere with these abilities, making it difficult to participate fully in normal social scenes.

Encoding the pitch of single and multiple sounds: Implications for auditory scene analysis

Andrew J. Oxenham

Department of Psychology

University of Minnesota

Many sounds in our environment, including voiced speech, music and a number of animal vocalizations, belong to the class of harmonic sounds, meaning that they are (quasi-) periodic. The percept most strongly associated with harmonic sounds is pitch, which correlates strongly with the fundamental frequency (F0) of the sound. In speech, pitch conveys prosodic and, in some languages, lexical information, and helps in identifying different talkers. In music, pitch provides the basis of melody and harmony. Most importantly for this talk, pitch is also thought to play a role in our ability to perceptually segregate competing simultaneous and sequential sounds. We will review some basic aspects of pitch coding of single and multiple sounds, and will discuss some new research aimed at elucidating the link between pitch perception and our ability to segregate competing talkers.

A cocktail party - with a cortical twist

Mounya Elhilali

Department of Electrical and Computer Engineering

Johns Hopkins University

The perceptual organization of sounds in the environment into coherent objects is a feat constantly facing the auditory system. It manifests itself in the everyday challenge to humans and animals alike to parse complex acoustic information arising from multiple sound sources into separate auditory streams. While seemingly effortless, uncovering the neural mechanisms and computational principles underlying this remarkable ability remain a challenge facing both the biological and mathematical communities. In this talk, I discuss how this perceptual ability of the auditory system may emerge as a consequence of a multi-scale spectro-temporal analysis of sound in the auditory cortex, which is thought to play a role in the perceptual ordering of acoustic events. In addition, I present recent findings of adaptive neuronal responses in the auditory cortex, which are likely to play a key role in adapting the neural representation to reflect both the sensory content and the changing behavioral context of complex acoustic scenes. Guided by these physiological results, I shall present a computational approach to the dynamic segregation of auditory streams, based on unsupervised clustering and the statistical theory of Kalman filtering.

Motion estimation and natural visual signals

Rob de Ruyter van Steveninck

Department of Physics

Indiana University Bloomington

Sensory information processing can be seen as a statistical estimation problem, in which relevant features are extracted from a raw stream of sensory input. Those features are present in an imperfect and implicit form, and the optimal solution to the feature extraction problem depends on the statistics of the input signals. Here we study the properties of natural visual input signals in relation to the problem of visual motion estimation.

Many animals use vision to estimate their motion through space, which makes the problem biologically highly relevant. In the 1950’s Reichardt and Hassenstein formulated an explicit model for motion estimation based on insect behavioral experiments. Their observations revealed certain specific biases in animal motion responses, which were captured by their correlation model of motion detection. Later, an alternative visual motion estimation model was put forward by Limb and Murphy. Their proposal, also known as the ratio of gradients model, does not show those characteristic biases.

Here I will approach motion estimation from two angles: Animal experiments, and statistical sampling of natural signals. First we will look at motion estimation in the visual system of the blowfly, with an emphasis on performance under natural conditions. As noted above, the array of photoreceptors in the retina implicitly contains data on self motion. However, this relation is noisy, indirect and ambiguous due to photon shot noise and optical blurring. Further, natural variations in the visual signal to noise ratio are enormous, and nonlinear operations are especially susceptible to noise. One can therefore reasonably hope that animals have evolved interesting optimization strategies to deal with large variations in signal quality. Experimental data from motion sensitive neurons in the fly brain illustrate some of these solutions. We will then look at the results of a study in which we derive optimal visual motion estimators from direct sampling of the relevant natural signals. A comparison of the two approaches suggests that the fly approaches optimal estimation strategies, and that these solutions in fact interpolate between the two motion estimation models mentioned above.

Object recognition by scene alignment

Antonio Torralba

Department Electrical Engineering and Computer Science

Massachusetts Institute of Technology

Object detection and recognition is generally posed as a matching problem between the object representation and the image features (e.g., aligning pictorial cues, shape correspondence, constellations of parts, etc.) while rejecting the background features using an outlier process. In this work, we take a different approach: we formulate the object detection problem as a problem of aligning elements of the entire scene. The background, instead of being treated as a set of outliers, is used to guide the detection process. Our approach relies on the observation that when we have a big enough database then we can find with high probability some images in the database very close to a query image, as in similar scenes with similar objects arranged in similar spatial configurations. If the images in the retrieval set are partially labeled, then we can transfer the knowledge of the labeling to the query image, and the problem of object recognition becomes a problem of aligning scene regions. But, can we find a dataset large enough to cover a large number of scene configurations? Given an input image, how do we find a good retrieval set, and, finally, how we do transfer the labels to the input image? We will use two datasets; 1) the LabelMe dataset, which contains more than 10,000 labeled images with over 180,000 annotated objects. 2) The tiny images dataset: A dataset of weakly labeled images with more than 79,000,000 images. We use this database to perform object and scene classification, examining performance over a range of semantic levels.

Work in collaboration with Rob Fergus, Bryan Russell, Ce Liu and William T. Freeman

Date	Speaker	Institution	Title
Feb. 11	Ramani Duraiswami	University of Maryland	Creating Virtual Audio Displays
Feb. 18	Yiannis Aloimonos	University of Maryland	HAL:Human Activity Language - Introduction to Sensorimotor Linguistics
Feb. 25	Jianbo Shi	University of Pennsylvania	Visual Thinking with Graph Network
March 3	Ronen Basri	Weizmann Institute of Science	Algorithmic and Perceptual Aspects of Lighting
March 10	Mike Lewicki	Carnegie Mellon University	Generalization and Perceptual Organization in Natural Scenes
March 24	Fei-Fei Li	Princeton University	Telling the Story of a Scene: From Humans to Computers
March 31	Paul Schrater	University of Minnesota	The Role of Generative Knowledge in Perception and Action
April 7	Barbara Shinn-Cunningham	Boston University	Going with the flow in a cocktail party
April 14	Andrew Oxenham	University of Minnesota	Encoding the pitch of single and multiple sounds: Implications for auditory scene analysis
April 21	Mounya Elhilali	Johns Hopkins University	A cocktail party - with a cortical twist
April 28	Rob de Ruyter van Steveninck	Indiana University	Motion estimation and natural visual signals
May 5	Antonio Torralba	Massachusetts Institute of Technology	Object recognition by scene alignment