Date |
Speaker |
Institution |
Title |
Feb. 11 | Ramani Duraiswami | University of Maryland | Creating Virtual Audio Displays |
Feb. 18 | Yiannis Aloimonos | University of Maryland |
HAL:Human Activity Language - Introduction to Sensorimotor
Linguistics
|
Feb. 25 | Jianbo Shi | University of Pennsylvania | Visual Thinking with Graph Network |
March 3 | Ronen Basri | Weizmann Institute of Science | Algorithmic and Perceptual Aspects of Lighting |
March 10 | Mike Lewicki | Carnegie Mellon University | Generalization and Perceptual Organization in Natural Scenes |
March 24 | Fei-Fei Li | Princeton University | |
March 31 | Paul Schrater | University of Minnesota | The Role of Generative Knowledge in Perception and Action |
April 7 | Barbara Shinn-Cunningham | Boston University | |
April 14 | Andrew Oxenham | University of Minnesota | Encoding the pitch of single and multiple sounds: Implications for auditory scene analysis |
April 21 | Mounya Elhilali | Johns Hopkins University | A cocktail party - with a cortical twist |
April 28 | Rob de Ruyter van Steveninck | Indiana University | Motion estimation and natural visual signals |
May 5 | Antonio Torralba | Massachusetts Institute of Technology | Object recognition by scene alignment |
Creating virtual auditory displays
Ramani Duraiswami
Department of Computer Science
An Introduction to
Sensorimotor Linguistics
Yiannis Aloimonos
Department of Computer Science
We
propose a linguistic approach to model human activity. This approach is able to
address several problems related to action nterpretation in a single framework.
The Human Activity Language (HAL) consists of kinetology,
morphology, and syntax. Kinetology, the phonology of human movement, finds
basic primitives for human motion (segmentation) and associates them with
symbols (symbolization). The input is measurements of human movement in 3D
(signals), as for example produced by motion capture systems. This way,
kinetology provides a non-arbitrary grounded symbolic representation for human
movement that allows synthesis, analysis, and symbolic manipulation. The
morphology of a human action is related to the inference of essential parts of
the movement (morpho-kinetology) and its structure (morpho-syntax). In order to
learn the morphemes and their structure, we present a grammatical inference
methodology and introduce a parallel learning algorithm to induce a grammar
system representing a single action. In practice, morphology is concerned with
the construction of a vocabulary of actions or a praxicon.
The syntax of human activities involves the construction of
sentences using action morphemes. A sentence may range from a single action
morpheme (nuclear syntax) to a sequence of sets of morphemes. A single morpheme
is decomposed into analogs of lexical categories: nouns, adjectives, verbs, and
adverbs. The sets of morphemes represent simultaneous actions (parallel syntax)
and a sequence of movements is related to the concatenation of activities
(sequential syntax). Nuclear syntax, especially adverbs, is related to the
motion interpolation problem, parallel syntax addresses the slicing problem, and
sequential syntax is proposed as an alternative method to the transitioning
problem. Consequences of the framework to surveillance, automatic video
annotation, humanoid robotics and Cognitive Science will be discussed throughout
the talk.
*: Joint work with Gutemberg Guerra, Alap Karapurkar, Yi Li
Visual Thinking with Graph Network
Jianbo
Shi
Computer and Information Science
University of Pennsylvania
Many visual perception tasks are fundamentally NP-hard
computational problems. Solving these problems robustly requires thinking
through combinatorially many hypothesis. Despite this, our human
visual system performs these tasks effortlessly. How is this
done? I would like to make two points on this topic.
First, formulating visual thinking as NP-hard computation tasks has an
important advantage: visual routines can be analyzed precisely to
identify their behaviors independently of their implementations.
Second, I will show there is a class of graph optimization problems which
can be implemented using a distributed network system with physical
(and plausible biological) interpretation.
I will demonstrate this graph based approach for: 1)
image segmentation using Normalized Cuts with explanations for
illusory contours, visual pop out and attention; 2) salient contour
grouping using Untangling Cycle; and 3) contour context selection for
shape detection.
This is a
joint work with Stella Yu and Qihui Zhu.
Algorithmic and
Perceptual Aspects of Lighting
Ronen Basri
Weizmann Institute of
Science
Toyota Technological Institute
Variations in lighting can significantly affect the appearance of objects. Understanding lighting is important in order to address problems that require invariance to lighting (e.g., object recognition). Moreover, lighting provides a strong cue from which the 3D shape of objects can be inferred. Modeling lighting can therefore lead to algorithms for shape recovery that can handle objects with smooth, texture-less surfaces that are difficult to handle with other methods. In this talk I will present methods to (1) model the effect of complex lighting on Lambertian surfaces by using spherical harmonic approximations, (2) introduce prior knowledge into algorithms for face reconstruction from single gray-level and two-tone images, and (3) recover the 3D shape of moving objects while avoiding the common brightness constancy assumption.
Generalization and perceptual organization in natural
scenes
Mike Lewicki
Computer Science Department and Center for the Neural Basis of Cognition
Carnegie Mellon
University
Traditional approaches to
visual perceptual organization such as Gestalt "laws" pertain largely to the
aggregation of primitive visual
features. Although
these principles can explain the perceptual grouping of features like bars and
circles, they provide relatively
little insight into how
the visual system organizes information in a natural scene. In this talk,
I will discuss an alternative approach
based on the idea
of forming invariant representations for local regions of the visual scene.
I will present a model in which higher-
level
neurons encode probability distributions over their inputs to form stable
representations, even across complex patterns of
variation. Trained on natural images, the model
learns a compact set of visual codes that describe image distributions
typically
encountered in local regions of natural
scenes. I will show that neurons in the model account for a wide range of
non-linear effects
observed in complex cells and neurons
in higher visual areas such as V2 and V4. These results provide the first
functional explanation of
these response properties and
offer insight into the computational problems the visual system must solve to
organize the complex visual information in natural scenes.
This is joint work with Yan
Karklin.
Telling the Story of a
Scene: From Humans to Computers
Fei-Fei Li
Department of Computer Science
For both humans and machines, the ability to learn and
recognize the semantically meaningful contents of the visual world is an
essential and important functionality. In this talk, we will examine the topic
of natural scene categorization and recognition in human psychophysical and
physiological experiments as well as in computer vision modeling. I will first
present a series of recent human psychophysics studies on natural scene
recognition. All these experiments converge to one prominent phenomena of the
human visual system: humans are extremely efficient and rapid in capturing the
semantic contents of the real-world images. Inspired by these behavioral
results, we report a recent fMRI experiment that classifies different types of
natural scenes (e.g. beach vs. building vs. forest, etc.) based on the
distributed fMRI activity. This is achieved by utilizing a number of pattern
recognition algorithms in order to capture the multivariate nature of the
complex fMRI data. In the second half of the talk, we begin with a generative
Bayesian hierarchical model that learns to categorize natural images in a weakly
supervised fashion. We represent an image by a collection of local regions,
denoted as codewords obtained by unsupervised clustering. Each region is then
represented as part of a `theme'. In previous work, such themes were learnt from
hand-annotations of experts, while our method learns the theme distribution as
well as the codewords distribution over the themes without such supervision. We
report excellent categorization performances on a large set of 13 categories of
complex scenes. If time permits, we will show a series of recent works in our
lab toward the holistic and integrative analysis of scene understanding.
The Role of Generative Knowledge in Perception and
Action
Paul Schrater
Departments of Psychology and Computer Science
Generative knowledge denotes the brain's understanding of the causal relationships between scene variables, their typical distributions, and the way scene variables are mapped to sensory data. Generative knowledge of our motor behavior is critical for controlling our actions and predicting their outcomes. Generative knowledge in perception may underlie the high level of functionality despite complex and ambiguous sensory input. Fro example, typical images can consist of dozens or hundreds of objects, many of which are overlapping. Similar 3D object(s) can result in many different images, and different objects can result in similar images. It is well- established that local visual features such as sharp intensity changes are ambiguous; in the absence of context, it is difficult for an algorithm or human to determine whether intensity change in a small region of a natural image reflects a significant object property or irrelevant clutter. In contrast, high-level, everyday human vision is rarely ambiguous, and ambiguities are often easy to resolve when they do exist. How is it done?
Generative knowledge coupled with Bayesian
statistical inference provide a coherent theoretical
frameworkfor explaining how human perception uses contextual
information about scenes and objects to resolve ambiguity.
I will present an overview of generative models in
human vision and motor control, and describe specific experiments
that suggest:
1) Humans use knowledge of causal relationships between variables to
incorporate auxiliary information in perceptual decisions 2) Humans can infer
the causes of scene properties 3) Humans can use generative knowledge to make
decisions about when to collect more perceptual information, and when to act
without it.
Going with the flow in a cocktail party
Barbara Shinn-Cunningham
Department of Cognitive and Neural Systems
Imagine yourself in a circle of friends at a cocktail party. A cacophonous mixture of multiple talkers reaches your ears, yet you are able to focus attention on the talker of interest and extract what they are saying. Moreover, which talker is the "target" changes unexpectedly and unpredictably, as the banter jumps from talker to talker (from one auditory object to another). How do we make sense of the this kind of complex auditory scene? First we will review how we focus attention on one object in a mixture of similar objects, along with evidence that one (and only one) object is really the focus of attention at any one moment. Then we will consider how, despite the fact that there are costs of switching attention from one auditory object to another, we are able to follow a conversation in an unpredictable social setting. Finally, we will consider how even modest hearing lost can interfere with these abilities, making it difficult to participate fully in normal social scenes.
Encoding the pitch of single and multiple sounds: Implications for auditory scene analysis
Andrew J. Oxenham
Department of Psychology
Many sounds in our environment, including voiced speech, music and a number of animal vocalizations, belong to the class of harmonic sounds, meaning that they are (quasi-) periodic. The percept most strongly associated with harmonic sounds is pitch, which correlates strongly with the fundamental frequency (F0) of the sound. In speech, pitch conveys prosodic and, in some languages, lexical information, and helps in identifying different talkers. In music, pitch provides the basis of melody and harmony. Most importantly for this talk, pitch is also thought to play a role in our ability to perceptually segregate competing simultaneous and sequential sounds. We will review some basic aspects of pitch coding of single and multiple sounds, and will discuss some new research aimed at elucidating the link between pitch perception and our ability to segregate competing talkers.
A cocktail party - with a cortical twist
Mounya Elhilali
Department of Electrical and Computer Engineering
The perceptual organization of sounds in the environment into coherent objects is a feat constantly facing the auditory system. It manifests itself in the everyday challenge to humans and animals alike to parse complex acoustic information arising from multiple sound sources into separate auditory streams. While seemingly effortless, uncovering the neural mechanisms and computational principles underlying this remarkable ability remain a challenge facing both the biological and mathematical communities. In this talk, I discuss how this perceptual ability of the auditory system may emerge as a consequence of a multi-scale spectro-temporal analysis of sound in the auditory cortex, which is thought to play a role in the perceptual ordering of acoustic events. In addition, I present recent findings of adaptive neuronal responses in the auditory cortex, which are likely to play a key role in adapting the neural representation to reflect both the sensory content and the changing behavioral context of complex acoustic scenes. Guided by these physiological results, I shall present a computational approach to the dynamic segregation of auditory streams, based on unsupervised clustering and the statistical theory of Kalman filtering.
Motion estimation and natural visual signals
Rob de Ruyter van Steveninck
Department of Physics
Sensory information processing can be seen as a statistical estimation problem, in which relevant features are extracted from a raw stream of sensory input. Those features are present in an imperfect and implicit form, and the optimal solution to the feature extraction problem depends on the statistics of the input signals. Here we study the properties of natural visual input signals in relation to the problem of visual motion estimation.
Many animals use vision to estimate their motion through space, which makes the problem biologically highly relevant. In the 1950’s Reichardt and Hassenstein formulated an explicit model for motion estimation based on insect behavioral experiments. Their observations revealed certain specific biases in animal motion responses, which were captured by their correlation model of motion detection. Later, an alternative visual motion estimation model was put forward by Limb and Murphy. Their proposal, also known as the ratio of gradients model, does not show those characteristic biases.
Here I will approach motion estimation from two angles: Animal
experiments, and statistical sampling of natural signals. First we will look at
motion estimation in the visual system of the blowfly, with an emphasis on
performance under natural conditions. As
noted above, the array of photoreceptors in the retina implicitly contains data
on self motion. However, this relation is noisy, indirect and ambiguous due to
photon shot noise and optical blurring. Further, natural variations in the visual signal to noise ratio are
enormous, and nonlinear operations are especially susceptible to noise. One can
therefore reasonably hope that animals have evolved interesting optimization
strategies to deal with large variations in signal quality. Experimental data
from motion sensitive neurons in the fly brain illustrate some of these
solutions. We will then look at the results of a study in which we derive
optimal visual motion estimators from direct sampling of the relevant natural
signals. A comparison of the two approaches suggests that the fly approaches
optimal estimation strategies, and that these solutions in fact interpolate
between the two motion estimation models mentioned above.
Object recognition by scene alignment
Antonio
Torralba
Department Electrical Engineering and Computer Science
Object
detection and recognition is generally posed as a matching problem between the
object representation and the image features (e.g., aligning pictorial cues,
shape correspondence, constellations of parts, etc.) while rejecting the
background features using an outlier process. In this work, we take a different
approach: we formulate the object detection problem as a problem of aligning
elements of the entire scene. The background, instead of being treated as a set
of outliers, is used to guide the detection process. Our approach relies on the
observation that when we have a big enough database then we can find with high
probability some images in the database very close to a query image, as in
similar scenes with similar objects arranged in similar spatial configurations.
If the images in the retrieval set are partially labeled, then we can transfer
the knowledge of the labeling to the query image, and the problem of object
recognition becomes a problem of aligning scene regions. But, can we find a
dataset large enough to cover a large number of scene configurations? Given an
input image, how do we find a good retrieval set, and, finally, how we do
transfer the labels to the input image? We will use two datasets; 1) the
LabelMe dataset, which contains more than 10,000 labeled images with over
180,000 annotated objects. 2) The tiny images dataset: A dataset of weakly
labeled images with more than 79,000,000 images. We use this database to
perform object and scene classification, examining performance over a range of
semantic levels.
Work
in collaboration with Rob Fergus, Bryan Russell, Ce Liu and William T. Freeman