Photo
Abhijit S. Ogale

Senior Engineer, Computer Vision and Graphics
Google Inc.
Home
Curriculum vitae
Publications
Research  Teaching
Download Code

Research

·         Low and intermediate-level vision

·         Visual motion analysis

·         Human action understanding

·         Camera Networks

·         Video stabilization and tracking


Low and intermediate-level vision   

Feedback

 

Summary: Low and intermediate-level vision consists of several problems, such as the computation of stereo disparity, optical flow, depth, shape, occlusions, 3D motion, and various segmentation problems based on modalities such as depth, motion, texture and color. These problems depend on each other in a chicken-and-egg fashion. For ease of formulation and solution, they are often treated independently, which only leads to sub-optimal and sometimes even incorrect solutions. As part of my doctoral work, I have shown that problems such as image correspondence, segmentation and shape are inseparable, and must only be solved together. This work has led to new compositional stereo and optical flow algorithms, which succeed in cases where other approaches fail.


Some results:

Stereo and optical flow results for two pairs of images. Occlusions are shown in white. Color-coded disparities are shown for the stereo pair, while X and Y components of the optical flow are shown separately for the latter pair.

stereoflowimage


In the figure below, the top row shows a stereo pair with a horizontally slanted blue object, while the bottom row shows a pair with a vertically slanted blue object. Using a standard algorithm such as graph cuts on these pairs (third column of the figure) yields fronto-parallel flat surfaces as depth solutions for the blue objects, even though the we know that these objects are slanted. Our approach (see reference 1, CVPR'04 below) which simultaneously estimates disparity, slant, and the segmentation, gives correct results (rightmost column of the figure).


In the figure below, stereo results for four pairs of images are shown. In each pair, there is a contrast mismatch between the images. In some cases, the contrast is spatially varying. Disparities are color coded, and occlusions are shown in white. Contrast invariant matching is achieved using information in spatial frequency channels for local matching.

contrast

Back to the top

Visual motion analysis

Summary Visual motion analysis includes problems such as camera motion estimation, 3D structure from video, and the detection of independently moving objects in a video. In my doctoral work, I classify independently moving objects into three groups, including a previously unknown group of moving objects which is found using occlusions. I have created algorithms which automatically discover ordinal depth relations in a video using occlusions to find new moving objects. This techniques can be used as building blocks for applications such as semi-autonomous (e.g., driver assistance technology for cars) and autonomous navigation (e.g., unmanned ground vehicles), video compression, surveillance and graphics.


Some results:

There are three classes of independently moving objects: (a) Class 1: those which are detected using motion alone (b) Class 2: those which are detected using motion and occlusions by looking for ordinal depth violations, and (c) Class 3: those which are detected by comparing depth from motion with depth from another source (such as stereo). Toy examples of these three classes are shown below. In each case, the red object moves independently, and the arrows indicate the optical flow. Dashed regions indicate regions which will soon be occluded due to the movement.

 

 

The first column shows a situation in which the background objects (non-independently moving) are translating horizontally, while the red object is moving vertically. In this scenario, motion based clustering approaches will be successful and such Class 1 objects can be detected using motion alone. The second column shows a situation in which the background objects are translating horizontally to the right, and the red object also moves towards the right. In this scenario, motion clustering will fail, and we also need the occlusions to find such objects. The occlusions tell us that the red object is behind the black object. However, if we compute depth from motion, since the motion is predominantly a translation, the result would indicate that the red object is in front of the black object (since the red object moves faster). This conflict signals Class 2 moving objects. The third column shows a situation similar to the second column, except that the black object which was in front of the red object has been removed. The ordinal depth conflict in the earlier case is no longer present, and we must employ cardinal comparisons between structure from motion, and structure from another source (such as stereo) to identify Class 3 moving objects.

Finding ordinal depth using occlusions: Given two frames from a video, occlusions are points in one frame which have no corresponding point in the other frame. However, merely knowing the occluded regions is not sufficient to deduce ordinal depth. We also need to know `who occluded what' as opposed to merely knowing `what was occluded'. Thus, we have to group occluded regions with their neighboring visible regions. Then, if we find (say) that region R1 disappeared under region R2, then we can say that R1 is behind R2. 

The figure below demonstrates the idea behind finding ordinal depth by filling in occlusions found in the optical flow. The top portion (a) shows three frames of a video sequence. The yellow region which is visible in  Frame 1  and  Frame 2  disappears behind  the tree (i.e., becomes occluded) in  Frame 3. The next row (b) shows the reverse optical flow (frame 2 to 1) and the forward flow (frame 2 to 3). Only the x-components are shown. Occlusions are colored white. In (c), occlusions in the forward flow u23, are filled using the segmentation from the reverse flow u21. After filling, we can find ordinal depth relations as shown in (d), where the tree (marked in green) is found to be in front of the region on it's left. The advantage of this technique is that it uses purely optical flow information, and is directly applicable even in the case where independently moving objects are present.

occl filling

 Motion segmentation results: Here are some examples where each type of object is detected in real video sequences. (Note that for class 3, stereo images are used to compare depth from motion with depth from stereo; that is why the left and right stereo images are shown for the middle frame in the case of class 3.)

Back to the top

Human action understanding

Summary: Human activity, like human speech, requires mechanisms which can be used for both recognitive and generative purposes. The relationship runs even deeper, since human speech is mostly used to describe actions. Hence, it makes sense to examine whether the computational models for speech can also be applied to the problem of recognizing and generating human actions. Adopting this viewpoint, my current research seeks to model human activity using grammars. Loosely speaking, the alphabet of this language consists of body poses (which include motion data), the words can be thought of as actions (such as jump, kneel), while sentences describe activity. Sequences of simple actions can be parsed to discover more abstract descriptions of activity. We use training videos using many actors and multiple viewpoints, with each actor performing a given set of basic actions. The figure below shows a dataset with 8 views, 10 actors, and several sample keyposes. 

views

Key poses are detected by extracting keyframes, which are extreme pose or movement configurations. These are found using the optical flow. Here is an example with the sit and stand actions. For more examples, click here.

Optical flow: (The video below shows the action (left), X-component of the optical flow (middle) and Y-component (right).  Note: blue indicates negative, red indicates positive flow.) Click the image below to play the video.

sitstand

sitkeyframes

This training data is used to create a model (a probabilistic context-free grammar), which is then used for recognizing (parsing) actions and viewpoints within a new video. The figure below shows an example of viewpoint and 3D pose recognition using this system. The input video (in the leftmost column) shows a person walking on a circle, then picking up something on the floor. On the right side, each row shows the identified multi-viewpoint 3D pose from the database, and the orange cells denote the identified view.

The figure below shows the most probable parse tree returned by the system for a novel sequence involving four actions performed in sequence (walk, turn, kick, kneel).

Back to the top


Camera Networks

Summary: At the University of Maryland, I have studied various camera networks, including the 64 camera system at the Keck laboratory. I have also designed and constructed a complete mobile camera network, which involved hardware setup, video capture software creation in Linux, and construction of the camera synchronization network. This system can continuously capture synchronized digital video at a resolution of 1024x768 (gray) at 15 fps from nine cameras for a duration of 30 minutes with only a single PC (Dell Precision 620, 866Mhz PIII processor and four SCSI 15K rpm drives). Here is a  photo of the system being used outdoors. 

argus

This system was used for several purposes. The photo shows the cameras in an  omnidirectional Argus Eye configuration, with cameras mounted on a wooden octahedron, which is carried around by a person. The papers related to the Argus Eye configuration are given below.

Back to the top


Video stabilization and tracking

Summary: Video stabilization and moving object detection and tracking software for project Video Verification of Identity (VIVID).

Some stabilization results: The clip shows the original input video on top, and the stabilized video at the bottom. The stabilization is reset every sixty frames.



Here is another example where the same technique is applied to a video with plenty of depth variation, and large independent object motion. The video shows the stabilized sequence, and the inset on the top-left shows the original video.


Back to the top



Last updated: Feb 2008  
This page will soon move to a new location.