To be able to index  on the basis of action, we need to develop the appropriate representations. Representations of human action appear to be quite intricate and they are related to language. Both action and language have a recognitive and a generative component. (we can generate and recognize actions like we can produce and understand language). If we observe a video scene where one person enters a room, to recognize this action it is necessary to know the meaning of "enter" and the concept of a "room". On the other hand, there are many actions that could be recognized by identifying movements of different parts of the body, like raising one's arm, turning, sitting, walking, etc. We define such actions as visual verbs. Visual verbs amount to actions (performed by one person) defined by movements of parts of the body. Thus our effort started with work on visual verbs. Additionally, we have begun to explore the visual "adverbs" and "adjectives" characterizing a given action (e.g., walking), representing the manner in which it is performed and/or the individual performing the action.

 

 

The Anchors of Action

We have discovered through experiments with Prof. Nakayama of Harvard University that for a human observer looking at action video, only a few frames of the video are enough for recognition. Humans appear to have an amazing capacity at recognizing action, even if the video is highly distorted. See for example video 1 showing human action, and videos 2,3, 4 and 5 below where a tremendous amount of different distortions has been applied and judge for yourself how recognition degrades.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Motivated by this we developed a computational mechanism for selecting key frames or poses, i.e. video frames that collectively describe the action (that is, if you only  see these poses in succession, you recognize the action). In our current implementation we utilize optic flow measurements (the average flow of the silhouette) and select the poses at the maxima and minima of the flow values.  Intuitively, these poses correspond to changes in the dynamics. The partition becomes optimal when the poses are such that the same dynamics govern the transition from one pose to another.Another idea we are currently experimenting for keyframe selection with is the use of phase space that we can obtain from motion capture data. The current technique provides excellent results.The following pictures show selection of keyframes for a number of actions.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Thus, an action becomes the appropriate sequence of silhouettes. If we could develop a model for a number of actions, then given a video depicting an action, we would first extract the key frame silhouettes or poses from the video as explained before, and then we would match them to the model hoping to achieve recognition.

This is what is in principle done in this work, by taking into account the grammatical structure of actions, as the following figures show.

 

We would like to describe the actions in probabilistic grammars, whose terminals are the silhouettes themselves, or parts of them. Our solution resembles speech recognition where instead of phonemes we have silhouettes. So grammars or hidden markov models are the things that come to mind.

Next we describe the learning stage, or model development stage.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The Learning Stage

 

In our multiview laboratory, we observed 11 different actions of 10 people from 8 different views. We took care in this stage to minimize image processing problems by making the background white. In any case, this is a stage that has to precede everything else. The following two videos and figure show examples:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The goal then becomes to build an HMM encoding these actions.  We extract silhouettes and identify common silhouettes. Each state of the HMM corresponds to a pose and a viewpoint. We then build transitions, set up probabilities and  introduce silent states.For each pose and view, we average the silhouettes from different actors. The following figures shows a part of the HMM and some of the averaged silhouettes (hidden states of the HMM).

 

                                              

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The Approach

Sample Pickup Video

Sample Walk Video Video

Input Keyframe Silhouettes for Different Actors (walk, bentknees,stand)

Part of the HMM

Fuzzy Poses: Hidden States

Video 1

 

Video 2

 

 

Video 3

 

 

Video 4

 

 

Video 5

The Grammars of Human Behavior

PIs : Yiannis Aloimonos & Ken Nakayama      

A project funded by the National Science Foundation (HSD)