2.1 Articulatory Phonetics
Articulatory phonetics considers the anatomical detail of the production speech sounds.
This requires the description of speech sounds in terms of the position of the vocal or-
gans. For this purpose, it is convenient to divide the speech sounds into vowels and con-
sonants. The consonants are relatively easy to define in terms of the shape and position
of the vocal organs, but the vowels are less well defined and this may be explained be-
cause the tongue typically never touches another organ when making a vowel[8].When
considering the speech articulation, the shapes of the mouth during speaking vowels
remain constant while during consonants the shapes of the mouth changes. The vowel
is stationary, while the consonant is non-stationary.
2.2 Face Movement Related to Speech
The face can communicate a variety of information including subjective emotion, com-
munitive intent, and cognitive appraisal. The facial musculature is a three dimensional
assembly of small, pseudo- independently controlled muscular lips performing a variety
of complex orfacial functions such as speech, mastication, swallowing and mediation
of motion[7]. The parameterization used in speech is usually in terms of phonemes.
A phoneme is a particular position of the mouth during a sound emission, and corre-
sponds with specific sound properties. These phonemes in turn control the lower level
parameters for the actual deformations. The required shape of the mouth and lips for
the utterance of the phonemes is achieved by the controlled contraction of the facial
muscles that is a result of the activity from the nervous system[8].
Surface electromyogram (SEMG) is the non-invasive recording of the muscle ac-
tivity. It can be recorded from the surface using electrodes that are stuck to the skin
and located close to the muscle to be studied. SEMG is a gross indicator of the muscle
activity and is used to identify force of muscle contraction, associated movement and
posture[1] . Using an SEMG based system, Chan et al[2] demonstrated that the presence
of speech information in facial myoelectric signals. Kumar et al[3] have demonstrated
the use of SEMG to identify the unspoken sounds under controlled conditions. There
are number of challenges associated with the classification of muscle activity with re-
spect to the associated movement and posture, such as the sensitivity of the location
of electrodes, inter user variations, sensitivity of the system to variations in intrinsic
factors such as skin conductance, and to external factors such as temperature, and elec-
trode conditions. Veldhuizen et al[5] demonstrated the variation of facial EMG during
a single day and has shown facial SEMG activity decreased during the workday and
increased again in the evening.
One difficulty with speech identification using facial movement and shape is the
temporal variation when the user is speaking complex time varying sounds. With the
intra and inter subject variation in the speed of speaking, and the length of each sound,
it is difficult to determine a suitable window, and when the properties of the signal are
time varying, this makes identifying suitable features for classification less robust.The
other difficulties also arise from the need for segmentation and the identification of the
start and end of movement if the movement is complex. While each of these challenges
are important, as a first step, this paper has considered the use of vowel based verbal
5