based fusion can be treated as a pattern classification
problem in which the scores obtained with
individual classifiers are seen as input patterns to be
labelled as ‘accepted’ or ‘rejected’. Given a linearly
separable two-class training data, the aim is to find
an optimal hyperplane that splits input data in two
classes: 1 and -1 (the target values that correspond to
the ‘accepted’ and ‘rejected’ labels respectively)
maximizing the distance of the hyperplane to the
nearest data of each class. The optimal hyperplane is
then constructed in the feature space, creating a non
linear boundary in the input space.
5 RECOGNITION
EXPERIMENTS
In section 5.1 some preliminary experiments
involving face and speech multimodal identification
by using matcher weighting fusion are presented.
The prosody, vocal tract spectrum and face based
recognition systems used in our fusion experiments
are presented in section 5.2. Experimental results
obtained by using SVM and matcher weighting
fusion methods and combined according to three
different fusion strategies are shown in section 5.3.
5.1 Preliminary Experiments
In this section, the audio, video and multimodal
person identification experiments in the CLEAR’06
Evaluation Campaign (
http://www.clear-evaluation.org)
are presented. A set of audiovisual recordings of
seminars have been used, consisting of short video
sequences and matching far-field audio recordings.
For the acoustic speaker identification, 20
Frequency Filtering parameters were generated with
a frame size of 30ms and a shift of 10ms, and 20
corresponding delta and acceleration coefficients
were included. Gaussian Mixture Models (GMM)
with diagonal covariance matrices were used.
For the visual identification, a projection-based
technique was developed, which combines the
information of several images to perform the
recognition (Luque, Morros et al. 2006). Models for
all the users were created using segments of 15
seconds. The XM2VTS database was used as
training data for estimating the projection matrix.
For each test segment, the face images of the same
user were gathered into a group; then, for each
group, the system compared the images with the
person model.
Segments of different durations (1 and 5
seconds) corresponding to 26 personal identities
have been used for testing. Table 1 shows the correct
identification rates obtained for both audio and video
monomodalities and the fusion identification rate.
The identification results are clearly improved when
the multimodal fusion technique is used.
Table 1: Correct identification for both audio and video
systems and multimodal fusion.
Correct identification (%)
duration (s) Speech Video Fusion
1 75.0 20.2 76.8
5 89.3 21.4 92.0
5.2 Experimental Setup
A chimerical database has been created by relating
the speakers of the Switchboard-I speech database
(Godfrey, Holliman et al. 1990) to the faces of the
video and speech XM2VTS database (Lüttin, Maître
et al. 1998) of the University of Surrey. The
Switchboard-I database has been used for the
speaker recognition experiments. It is a collection of
2430 two-sided telephone conversations among 543
speakers (302 male, 241 female) from all areas of
the United States. Each conversation of the
Switchboard-I database contains two conversation
sides. For both recognition systems each speaker
model was trained with 8 conversation sides and
tested according to NIST’s 2001 Extended Data
task.
Speech scores have been obtained by using two
different systems: a voice spectrum based speaker
recognition system and a prosody based recognition
system. The spectrum based recognition system was
the same GMM system used in the preliminary
experiments and the UBM was a 32-component
Gaussian mixture model trained with 116
conversations.
In the prosody based recognition system a 9
prosodic feature vector was extracted for each
conversation side. Mean and standard deviation were
computed for each individual feature. The system
was tested with one conversation side, computing
the distance between the test feature vector and the k
feature vectors of the claimed speaker, using the k-
Nearest Neighbour method with k=3 and the
symmetrized Kullback-Leibler divergence.
XM2VTS database was used for the face
recognition experiments. It is a multimodal database
consisting of face images, video sequences and
speech recordings of 295 subjects. Only the face
images (four frontal face images per subject) were
used in our experiments. In order to evaluate
verification algorithms on the database, the
SECRYPT 2006 - INTERNATIONAL CONFERENCE ON SECURITY AND CRYPTOGRAPHY
20