We  apply  the  clean  speech  datasets  included  in  the 
Edinburgh Datasets (Botinhao, 2017) which is com-
prised  of  audio  recordings,  including  28  English 
speakers (14 men and 14 women), sampled at 48 kHz, 
to  train  our  proposed  system.  For  noise  audios,  we 
used a subset of the noise environments available in 
the  DEMAND  datasets  (Thiemann  et  al.,  2013). 
These  noise  environments  were  then  not  included 
inthe test set. The  DEMAND datasets include noise 
recordings  corresponding  to  six  distinct  acoustic 
scenes (Domestic, Nature, Office, Public, Street and 
Transportation), which are further subdivided in mul-
tiple  more  specific  noise  sources  (Thiemann  et  al., 
2013). Note that while we used clean speech and noise 
included in the Edinburgh Datasets, the samples used 
for  training  the  systems  are  not  the  noisy  samples 
found in the noisy speech subset of the Edinburgh Da-
tasets, but rather samples mixed using the method de-
scribed in (Valin, 2018). 
The model generalization is achieved by data aug-
mentation. Since  cestrum mean normalization  is not 
applied, the speech and noise signal are filtered inde-
pendently for each training example through a second 
order filter as 
12
12
12
34
1
()
1
rz rz
Hz
rz rz
โโ
โโ
++
=
++
                         (6)
 
where each of 
14
rr๏
 are following the uniform dis-
tribution in  the  [-3/8,  3/8]  range. We  vary  the  final 
mixed signal level to achieve robustness to the input 
noisy speech signal. The amount of noisy speech at-
tenuation is also limited to get a better trade-off be-
tween noise removal level and speech distortion. 
The test set used is the one provided in the Edin-
burgh Datasets, which has  been  specifically  created 
for  SE  applications  and  consists  of  wide-band  (48 
kHz) clean and noisy speech audio tracks. The noisy 
speech in  the set contains four different SNR levels 
(2.5dB,  7.5dB,  12.5dB,  17.5dB).  The  clean  speech 
tracks included in the set are recordings of two Eng-
lish language speakers, a male and a female. As for 
the noise recordings that were used in the mixing of 
the noisy speech tracks, those were selected from the 
DEMAND database. More specifically, the noise pro-
files found in the testing set are: 
โข   Office: noise from an office with keyboard typ-
ing and mouse clicking 
โข   Living: noise inside a living room 
โข   Cafe: noise from a cafe at a public square  
โข   Bus: noise from a public bus  
The  selected  evaluation  metrics  is  of  great  im-
portance in the effort of regular evaluation of one sys-
tem. In order to evaluate our system, we used a metric 
that focuses on the intelligibility  of  the voice  signal 
(STOI) ranges from 0 to 1 and a metric that focuses 
on the sound quality (PESQ) ranges from -0.5 to 4.5, 
with higher values corresponding to better quality. 
4  EXPERIMENTAL RESULTS 
AND DISCUSSIONS 
It was appropriate to present our results in a compari-
son between the reference RNNoise system and  the 
proposed method that makes use of the LTSD feature. 
As seen in Figure 4, it becomes apparent that the pro-
posed method has better performance in most acoustic 
scenarios and in  all  SNR levels, especially in  lower 
SNRs,  comparing  the  two  methods  with  the  PESQ 
quality metric. Our proposed method outperforms the 
RNNoise algorithm by 0.12 MOS points on average. 
Similarly,  examining  the  STOI  intelligibility  meas-
ure, as depicted in Figure 5, it is indicated that the pro-
posed  method  also  has  better  intelligibility  perfor-
mance.  
We observe a noticeable improvement in perfor-
mance than RNNoise method. Having also compared 
several  spectrograms  of  both  methods,  it  was  ob-
served that in general the proposed method does in-
deed subtract more noise components. 
Having taken these results into consideration, it is 
demonstrated that more detailed research is required 
in  future  work  to  reduce  speech  distortion  and  pro-
mote noise removal level for different application sce-
narios. 
Firstly, we find that adding more hidden layers in-
deed  be  beneficial  for  our  proposed  system.  Given 
that we provide the system with more and diverse in-
put information, the RNN might be able to better use 
the proposed features with additional hidden layers. 
Secondly, studying samples processed by our ex-
tended  system,  we  speculate  that  the  system  could 
benefit  from  changing  how  aggressively  the  noise 
suppression occurs. This can be achieved by fine-tun-
ing the value of the ๐พ parameter in the loss function 
(3),  keeping  in  mind  that  smaller ๐พ  values  lead  to 
more aggressive suppression. According to our exper-
iment, setting ๐พ=1/2  is an optimal balance. 
Finally,  we  believe  that  further  research  can  be 
done regarding the performance of our proposed sys-
tems as the training datasets increases in size and di-
versity.