tures are searched using a 3D complete model. After
image segmentation, the silhouette of the detected im-
age is compared to the virtual silhouette generated by
the model in some predefined postures. The posture
of the model which has the best match, according to
a projection histogram procedure, is chosen to be the
right one. (Bregler and Malik, 1998) use a complete
3D model approach that can be used both with a single
camera setting, but constrain the human motion to be
along a single direction, and with multiple cameras.
In any case, the initial position of the joints must be
specified by the user. The approaches using 3D mod-
els from monocular cameras can better deal with dif-
ferent view points, but have problems with occlusions
due to unpredictable variations of the person figures.
The above works deal with posture tracking in dif-
ferent ways. Some of them (Demirdjian et al., 2003;
Bregler and Malik, 1998; Sminchisescu and Triggs,
2003) use different tracking techniques for comput-
ing and updating the parameters of the model. These
works are indeed not focused on posture classifica-
tion, i.e., determining specific postures. Other works
instead propose a two-steps approach: first, model
matching or feature-based procedures are used to de-
termine a posture within a predefined set, then a tem-
poral filter is used to integrate these values over time.
For example, (Cucchiara et al., 2005b) uses projection
histograms to determine postures and then a Hidden
Markov Model to track them over time. In general,
when the goal is to recognize predefined postures,
temporal filtering allows for improving performance
with respect to frame by frame classification.
Finally, multi-camera setting has been used for
tracking human body movements: (Demirdjian et al.,
2003) use stereo vision and a 3D model of the upper
human body for real-time 3D tracking of head, torso
and arms, while in (Grammalidis et al., 2001) the pa-
rameters of an MPEG4 3D model are estimated using
the depth image coming from the person being ana-
lyzed. However, posture classification has not been
explicitly addressed in these works.
The approach to human posture tracking and clas-
sification presented here is based on stereo vision seg-
mentation. Real-time people tracking through stereo
vision (see for example (Beymer and Konolige, 1999;
Bahadori et al., 2005; Iocchi and Bolles, 2005))
has been successfully used for segmenting scenes in
which people move in the environment and are able to
provide not only information about the appearance of
a person (e.g. colors) but also 3D information of each
pixel belonging to the person.
In practice a stereo vision based people tracker pro-
vides, for each frame, a set of data in the form XYZ-
RGB containing a 2 1/2D model and color informa-
tion of the person being tracked. Moreover, corre-
spondences of these data over time are also available;
therefore, when multiple people are in the scene, we
have a set of XYZ-RGB data over time for each per-
son. Obviously, this kind of segmentation can be af-
fected by errors, but the experience we report in this
paper is that this phase is good enough to allow for
implementing an effective posture classification tech-
nique as described here. Moreover, the use of stereo-
based tracking guarantees a high degree of robustness
also to illumination changes, shadows and reflections,
thus making the system applicable in a wider range of
situations.
The contribution of this paper is to describe a
method for posture tracking and classification given
a set of data in the form XYZ-RGB, corresponding to
the output of a stereo vision based people tracker. The
presented method uses a novel 3D model of human
body, performs model matching through a variant of
the ICP algorithm, tracks the model parameters over
time, and then uses a Hidden Markov Model to model
posture transitions and to classify among a set of main
human postures: UP, SIT, BENT, ON KNEE, LAID.
The resulting system is able to reliably track human
postures, overcoming some of the difficulties in pos-
ture recognition, and in particular presenting higher
robustness to partial occlusions and to different view
points. Moreover, the system does not require any off-
line training phase, it just uses the first frames (about
10) in which the person is tracked to automatically
learn parameters that are then used for model match-
ing. During these training frames we only require that
the person is in the standing position (with any orien-
tation) and that his/her head is not occluded.
The evaluation of the method has been performed
on the actual output of a stereo vision based peo-
ple tracker, thus validating in practice the chosen ap-
proach. Results show the feasibility of the approach
and its robustness to partial occlusions and different
view points.
The paper is organized as follows. Section 2 de-
scribes the data processed by a stereo vision based
people tracker that are used as input for the method
described here. Section 3 presents a discussion about
the choice of the model that has been used for rep-
resenting human postures, while Section 4 describes
the tracking of the principal points and the computa-
tion of the parameters of the model. In Section 5 we
present the classification method and finally Section 6
includes experimental evaluation of the method. Con-
clusions and future work will conclude the paper.
2 IMAGE SEGMENTATION AND
PEOPLE TRACKING
The method presented in this paper takes as input a
sequence of data in the form XYZ-RGB that are rela-
tive to a person tracked in the scene. A stereo vision
VISAPP 2006 - MOTION, TRACKING AND STEREO VISION
262