A SIMPLE AND COMUTATIONALLY EFFICIENT ALGORITHM
FOR REAL-TIME BLIND SOURCE SEPARATION OF SPEECH
MIXTURES
Tarig Ballal, Nedelko Grbic and Abbas Mohammed
Department of Signal Processing, Blekinge Institute of Technology, 372 25 Ronneby, Sweden
Keywords: BSS, blind source separation, speech enhancement, speech analysis, speech synthesis.
Abstract: In this paper we exploit the amplitude diversity provided by two sensors to achieve blind separation of two
speech sources. We propose a simple and highly computationally efficient method for separating sources
that are W-disjoint orthogonal (W-DO), that are sources whose time-frequency representations are disjoint
sets. The Degenerate Unmixing and Estimation Technique (DUET), a powerful and efficient method that
exploits the W-disjoint orthogonality property, requires extensive computations for maximum likehood pa-
rameter learning. Our proposed method avoids all the computations required for parameters estimation by
assuming that the sources are "cross high-low diverse (CH-LD)", an assumption that is explained later and
that can be satisfied exploiting the sensors settings/directions. With this assumption and the W-disjoint or-
thogonality property, two binary time-frequency masks that can extract the original sources from one of the
two mixtures, can be constructed directly from the amplitude ratios of the time-frequency points of the two
mixtures. The method works very well when tested with both artificial and real mixtures. Its performance is
comparable to DUET, and it requires only 2% of the computations required by the DUET method. More-
over, it is free of convergence problems that lead to poor SIR ratios in the first parts of the signals. As with
all binary masking approaches, the method suffers from artifacts that appear in the output signals.
1 GENERAL INFORMATION
Blind source separation (BSS) consists of recovering
unobserved signals or “sources” from several ob-
served mixtures (Cardoso, 1998). Several algorithms
based on different source assumptions have been pro-
posed. Some common assumptions are that the
sources are statistically independent (Bell et al.,
1995), are statistically orthogonal (Weinstein et al.,
1993), are non-stationary (Parra et al., 2000), or can
be generated by finite dimensional model spaces
(Broman et al., 1999).
The Degenerate Unmixing and Estimation Tech-
nique (DUET) algorithm (Jourjine et al., 2000)
(Rickard et al., 2001) (Yilmaz, et al., 2004) and other
proposed methods (Bofill et al., 2000) exploit the
approximate W-disjoint orthogonality property of
speech signals to perform source separation. Two
signals are said to be W-disjoint orthogonal (W-DO)
when their time-frequency representations, are dis-
joint sets (Jourjine et al., 2000) (Rickard et al., 2001).
DUET uses an online algorithm to perform gradient
search for the mixing parameters, and simultaneously
construct binary time-frequency masks that are used
to partition one of the mixtures to recover the original
source signals. DUET was proved to be powerful in
speech source separation. Additionally, it is proved to
be more computationally efficient as compared to
other existing methods (Rickard et al., 2001). How-
ever, the idea that separation of W-DO sources re-
quires only classifying the time-frequency points of
mixtures has motivated us to look for a simpler ap-
proach. In other words, for W-DO sources the source
separation problem is as simple as classifying the
time-frequency points of a mixtures as belonging to
one source or another.
In this paper we propose a simple and highly com-
putationally efficient approach to achieve the above-
mentioned classification. For this purpose we have
introduced an additional assumption, that is the
sources are "cross high-low diverse (CH-LD)". In a
system with two sensors, two sources are said to be
CH-LD, if the two sources are not both close to the
same sensor. A source is close to a sensor, if its energy
105
Ballal T., Grbic N. and Mohammed A. (2006).
A SIMPLE AND COMUTATIONALLY EFFICIENT ALGORITHM FOR REAL-TIME BLIND SOURCE SEPARATION OF SPEECH MIXTURES.
In Proceedings of the International Conference on Signal Processing and Multimedia Applications, pages 105-109
DOI: 10.5220/0001571901050109
Copyright
c
SciTePress