to the low number of Dutch samples. It is clear that 
30 samples are not enough for building a model, let 
alone a model sufficient for cross-lingual purposes. It 
may  lead  to  overfitting  and  low  generalization. 
Training the model with the Hungarian set results in 
comparable  evaluation  metrics  on  the  development 
and  the  test  sets,  showing  a  good  generalization 
ability.  About  300  samples  (the  used  Hungarian 
corpus)  seems  to  be  sufficient  to  train  a  model  for 
cross-lingual  usage.    Here,  the  goal  was  not  to 
maximize the performance of one language, but to do 
a  preliminary  study  on  cross-lingual  detection 
possibilities  using  the  dataset.  A  comprehensive 
study on the Hungarian sample set has been carried 
out by Tulics et al. (2019).  
Features  calculated  on  phonemes  also  seem  to 
have  an  effect on  cross-lingual  performance.  These 
features can increase performance, as was seen with 
models  built  on  Hungarian  samples.  Segmentation 
was  done  automatically  by  force-alignment  ASR. 
Naturally, this automatic method may have errors, but 
it  seems  that  performance  increase  is  possible  even 
with such a fully automatic pipeline. 
Regression  results  show  the  same  tendency  as 
classification. A higher level of severity increases the 
dysphonia separation ability of features. 
Two  languages  are  considered  here,  Dutch  and 
Hungarian,  mainly  because  there  were  speech 
samples  available  with  same  linguistic  content. 
However, we acknowledge that this somewhat limits 
the  cross-lingual  generalization  ability  due  to  the 
spectral similarities of the two languages. As a future 
research, it would be good to extend the study with 
languages with larger differences.  
5  CONCLUSIONS 
In  the  present  work,  cross-lingual  experiments  of 
dysphonic  voice  detection  and  dysphonia  severity 
level estimation are carried out. The results show that 
this is possible using the datasets presented. Various 
acoustic features are calculated on the entire speech 
samples and at the phoneme level.  
It was found that cross-lingual  detection  of 
dysphonic speech is indeed possible with acceptable 
generalization  ability  and  features  calculated  on 
phoneme-level  parts  of  speech  can  in  improve  the 
results. Support vector machines and support vector 
regression  are  used  as  classification  and  regression 
methods. Feature selection and model training is done 
on  dataset  using  10-fold  cross  validation  of  one 
(source) language and evaluated on the other (target) 
language.  Considering  cross-lingual  classification 
test  sets,  0.86  and  0.81  highest  F1-scores  can  be 
achieved for features sets with the vowel /E/ included 
and excluded, respectively and 0.72 and 0.65 highest 
Pearson correlations can be achieved for features sets 
with  the  vowel  /E/  included  and  excluded, 
respectively.  In  the  future,  cross-linguistic 
experiments  are  considered  using  more  language 
independent  feature  extraction  techniques  and 
extended datasets. 
ACKNOWLEDGEMENTS 
Project no. K128568 has been implemented with the 
support  provided  from  the  National  Research, 
Development  and  Innovation  Fund  of  Hungary, 
financed  under  the  K_18  funding  scheme.  The 
research  was  partly  funded  by  the  CELSA 
(CELSA/18/027)  project  titled:  “Models  of 
Pathological  Speech  for  Diagnosis  and  Speech 
Recognition”. 
REFERENCES 
Ali, Z., Talha, M., & Alsulaiman,  M.  (2017). A practical 
approach: Design and  implementation of  a  healthcare 
software  for  screening  of  dysphonic  patients.  IEEE 
Access, 5, 5844–5857. 
Al-Nasheri, A., Muhammad, G., Alsulaiman, M., Ali, Z., 
Malki, K. H., Mesallam, T. A., & Ibrahim, M. F. (2017). 
Voice  pathology  detection  and  classification  using 
auto-correlation  and  entropy  features  in  different 
frequency regions. Ieee Access, 6, 6961–6974. 
Boone, D. R., McFarlane, S. C., Von Berg, S. L., & Zraick, 
R. I. (2005). The voice and voice therapy. 
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for 
support  vector  machines.  ACM  Transactions  on 
Intelligent Systems and Technology, 2(3), 27:1-27:27. 
Cohen, S. M., Kim, J., Roy, N., Asche, C., & Courey, M. 
(2012). Prevalence and causes of dysphonia in a large 
treatment-seeking  population.  The  Laryngoscope, 
122(2), 343–348. 
Johns, M. M., Sataloff, R. T., Merati, A. L., & Rosen, C. A. 
(2010).  Shortfalls  of  the  American  Academy  of 
Otolaryngology–Head  and  Neck  Surgery’s  Clinical 
practice  guideline:  Hoarseness  (dysphonia). 
Otolaryngology-Head and Neck Surgery, 143(2), 175–
177. 
Jungermann,  F.  (2009).  Information  extraction  with 
rapidminer.  Proceedings  of  the  GSCL 
Symposium’Sprachtechnologie Und EHumanities, 50–
61. 
Kiss,  G.,  Sztahó,  D.,  &  Vicsi,  K.  (2013).  Language 
independent  automatic  speech  segmentation  into 
phoneme-like units on the base of acoustic distinctive 
features.  2013  IEEE  4th  International  Conference  on