A Visionary Session

Synopsis of the Computer Vision Session
at the workshop on
The Interface of Three Areas of Computer Science with the Mathematical Sciences

In this synopsis, we summarize the presentations and the discussion of open problems.

THE NATURE OF COMPUTER VISION

The second session of the meeting concerned research in Computer Vision. Larry Davis (University of Maryland) noted that the central challenge of the field is to give computers the basic vision abilities that humans possess: for example, deciding whether a person is within view, interpreting a streetscape well enough to make navigation decisions for an automobile, and visualizing what a scene looks like from a different viewpoint or under different lighting conditions.

The input to these decision processes is one or more digital images. A two dimensional image is divided into rectangular boxes called pixels, and for each pixel there is an observed value (or set of values, if, for example, the image is color rather than black-and-white). Needless to say, the observations can be quite noisy, so it is important that algorithms be robust.

Tony Chan (UCLA) moderated the vision session, noting that there are major challenges in this field. We need to operate with only incomplete information because of factors such as motion or occlusion of some objects by others. In addition, many problems are inherently ill-posed.

Jitendra Malik (University of California at Berkeley) presented many open problems, dividing them into four topic areas: image reconstruction, visual control, image segmentation, and image recognition.

IMAGE RECONSTRUCTION

The reconstruction problem is an inverse problem: construct the three-dimensional scene, given a two-dimensional image of it. We want to recover characteristics such as the geometry of surfaces, reflectances, and motion. Human viewers get clues from stereo vision, the multiple views imparted by motion, symmetries, and prior expectations, and these are the basic tools that underlie computer reconstructions, too.

Problem 1: Given multiple views of a scene, find the corresponding features in the views. (Current algorithms are about 95% accurate, but this is not good enough to yield reconstructions that are completely satisfying to human viewers.)

Problem 2: Estimate the three-dimensional scene from multiple views. This problem is solved if the mapping is simple: (x,y,z) maps to (a(x),b(y),z)) but not for perspective projection. (One participant noted that total least squares, or errors-in-variables, is a possible approach to this problem.)

Problem 3: Develop effective algorithms for modeling reflectance and texture for natural objects. The basic physics has been known for 100 years, but modeling surfaces such as human skin still cannot be done satisfactorily.

VISUAL CONTROL

A second interesting area of research is visual control, used to provide visual feedback for obstacle avoidance and locomotion guidance. In the mid-1990s, a computer system developed by Carnegie Mellon University successfully controlled a car driving cross-country for over 90% of the travel time, but not in urban or congested areas. The control is hierarchical: we need to make low level decisions such as which lane to travel, then we need to decide on the next turn, and at all times we need a high-level plan to navigate to the final destination.

Problem 4: If we control a navigation system by making use of visual information, there will inevitably be delays in the feedback loop due to processing time. Thus we need to design control laws involving look-ahead to avoid instabilities in control.

IMAGE SEGMENTATION

Segmentation is a basic task in image processing, to partition a collection of pixels into objects. Object boundaries are cued by changes in brightness, color, texture, and stereographic depth, aided by motion and recognition of certain common objects (e.g., a chair). The processing is usually performed bottom-up (processing pixels to determine objects), but a top-down approach can also be useful, applying a model to the object to make hypotheses such as, "If I see a human body, I expect to see a face." Segmentation can be performed by surface fitting, or by probabilistic inference with a Markov random field model. More recently, graph partitioning has been applied to the problem (Shi and Malik). Each pixel is a node in the graph, and the weight of an edge is based on similarity between pairs of pixels in features such as brightness, texture, and their coordinate differences. The eigenvectors of the graph Laplacian can be used for partitioning the pixels into segments, but the mathematical theory is incomplete: properties of the eigenvectors are not well understood.

Problem 5: Perform segmentation by unifying the bottom-up and top down approaches and making use of all of the visual cues (brightness, color, depth, texture, etc.).

OBJECT RECOGNITION

A fourth interesting area is recognition of classes of objects (such as an automobile) or an instance of an object in a class (such as a Volkswagen Beetle). Humans may be able to distinguish 10,000-100,000 objects and are tolerant of different views, illuminations, and occlusions.

Problem 6: Develop a system that can infer material properties; for example, distinguish between skin and metal and make a decision about how hard to grasp the object.

Problem 7: Develop a unified framework for segmentation and recognition, representing variabilities within categories (eg., shape) by using the interplay between discriminative models (neural networks, support vector machines). and generative models (probabilistic models that can synthesize new instances)

MORE DIFFICULT TYPES OF OBJECT RECOGNITION

Larry Davis emphasized high level problems in his presentation, noting that segmentation is really deciding what to see. In some instances, it is necessary to segment the bumper on an automobile, while in others, the critical information is that the aggregate item is an automobile.

Motion is a helpful clue to segmentation, but it presents its own challenges:

Problem 8: Distinguish the moving object itself from its shadows and reflections, especially in the presence of dynamic changes in background (for example, due to rain, wind, lighting, ...).

Rigid objects are identified by matching to a library of images, or by characterizing their shape and texture. Deformable objects present more challenges but more opportunities; emotion detection, for instance, depends on studying the deformability of faces.

Problem 9: Identify deformable objects, with one of the hardest cases being human faces.

Davis noted that visual learning is essential at every level of image processing. We must reason about what we see using static and dynamic analysis to identify objects and refine hypotheses after further observations.

DIFFERENTIAL EQUATION MODELS IN VISION

Guillermo Sapiro (University of Minnesota) discussed some applications of partial differential equations to vision. He noted that ideas that are commonplace in one field often become ``hottest developments" in a different one, so communication among researchers with different backgrounds is essential. Over the course of the session, several participants presented examples in which computer science ideas were rediscovered in the partial differential equation literature, and in which mathematical ideas were rederived for vision.

Sapiro presented examples in which diffusion equations (similar to those used in thin film fluid dynamics but with different boundary conditions) can be used to fill in objects that are partially occluded.

THE MODELING PROBLEM

Even young children know which visual cues to use in order to segment an image: sometimes shape, sometimes color, sometimes texture.

Problem 10: How can a computer know which visual cues are important in each instance?

Problem 11: Use differential geometry to match features in 3-d images, such as brain scans, just as it is used to match 2-d images today.

Other problems were posed in the discussion period.

Problem 12: Segment images by the distance between the object and the camera in order to be able to peel off subimages to isolate deeper layers.

Problem 13: Compress images subject to constraints that preserve certain critical features such as coastlines.

Many problems arise from projective geometry:

Problem 14: Study orthogonal projection on non-flat manifolds, since perspective geometry leads to cones.

Others arise from possible mechanistic connections between computer vision and biological vision.

Problem 15: Thomas Poggio (MIT) asked, in biological vision, is an object represented as a series of views or as a three-dimensional entity? How close should computer vision algorithms be to biology?

Problem 16: Dimitri Tersopoulis (University of Toronto) noted the need for physical models, structural models with elastic component parts to capture the function of a complicated deformable object such as the human face.

SUMMARY

Vision is an example of a successful two-way street between mathematics and computer science: interesting mathematics has been used by vision researchers to make practical tools and commercially important innovations such as the JPEG standard for image storage. But, for instance, the work of some vision researchers in using anisotropic diffusion led mathematicians to study the differential equation model that they developed. Stanley Osher (UCLA) presented another example of rich interaction. A paper by Alverez, Guichard, Lions, and Morel, which axiomatized morphological based image processing, led to the analysis of motion of level sets by affine mean curvature. Also, L. Rudin's Cal-Tech Ph.D. thesis in Computer Science introduced shock wave theory and BV spaces to image restoration.

The mathematical foundations of computer vision are well recognized. Geometry has contributed the basis for projections and for the level set method, and topology aids in the understanding of deformable surfaces. Probability and statistics are used in estimation under uncertainty. Tools such as wavelet analysis, solving diffusion problems, and evolution of vector fields are contributed by analysis and partial differential equations, while the understanding of properties such as reflectance and radiosity comes from mathematical physics. Tools from harmonic analysis, for example, have been basic to the development of the JPEG standard for storing images. Algebraic geometry has been useful in curve recognition. Computational math is essential in developing fast algorithms, while many problems are best understood in terms of graph theory.

Despite the richness of past interactions, there is potential for much more collaboration. Jitendra Malik noted that the connection between theoretical computer science and combinatorics is not fully exploited yet, but in order to work in vision, we need to jump from continuous formulations to discrete ones and back, giving even more of an opportunity for interplay.

Copyright 2000, Dianne P. O'Leary oleary@cs.umd.edu
I'm grateful for the comments of several participants that improved a draft of this document, but all remaining errors are my fault alone.
Last update: 06-23-00
Return to workshop homepage.

A Visionary Session

Synopsis of the Computer Vision Session at the workshop on The Interface of Three Areas of Computer Science with the Mathematical Sciences

Synopsis of the Computer Vision Session
at the workshop on
The Interface of Three Areas of Computer Science with the Mathematical Sciences