to the low number of Dutch samples. It is clear that
30 samples are not enough for building a model, let
alone a model sufficient for cross-lingual purposes. It
may lead to overfitting and low generalization.
Training the model with the Hungarian set results in
comparable evaluation metrics on the development
and the test sets, showing a good generalization
ability. About 300 samples (the used Hungarian
corpus) seems to be sufficient to train a model for
cross-lingual usage. Here, the goal was not to
maximize the performance of one language, but to do
a preliminary study on cross-lingual detection
possibilities using the dataset. A comprehensive
study on the Hungarian sample set has been carried
out by Tulics et al. (2019).
Features calculated on phonemes also seem to
have an effect on cross-lingual performance. These
features can increase performance, as was seen with
models built on Hungarian samples. Segmentation
was done automatically by force-alignment ASR.
Naturally, this automatic method may have errors, but
it seems that performance increase is possible even
with such a fully automatic pipeline.
Regression results show the same tendency as
classification. A higher level of severity increases the
dysphonia separation ability of features.
Two languages are considered here, Dutch and
Hungarian, mainly because there were speech
samples available with same linguistic content.
However, we acknowledge that this somewhat limits
the cross-lingual generalization ability due to the
spectral similarities of the two languages. As a future
research, it would be good to extend the study with
languages with larger differences.
5 CONCLUSIONS
In the present work, cross-lingual experiments of
dysphonic voice detection and dysphonia severity
level estimation are carried out. The results show that
this is possible using the datasets presented. Various
acoustic features are calculated on the entire speech
samples and at the phoneme level.
It was found that cross-lingual detection of
dysphonic speech is indeed possible with acceptable
generalization ability and features calculated on
phoneme-level parts of speech can in improve the
results. Support vector machines and support vector
regression are used as classification and regression
methods. Feature selection and model training is done
on dataset using 10-fold cross validation of one
(source) language and evaluated on the other (target)
language. Considering cross-lingual classification
test sets, 0.86 and 0.81 highest F1-scores can be
achieved for features sets with the vowel /E/ included
and excluded, respectively and 0.72 and 0.65 highest
Pearson correlations can be achieved for features sets
with the vowel /E/ included and excluded,
respectively. In the future, cross-linguistic
experiments are considered using more language
independent feature extraction techniques and
extended datasets.
ACKNOWLEDGEMENTS
Project no. K128568 has been implemented with the
support provided from the National Research,
Development and Innovation Fund of Hungary,
financed under the K_18 funding scheme. The
research was partly funded by the CELSA
(CELSA/18/027) project titled: “Models of
Pathological Speech for Diagnosis and Speech
Recognition”.
REFERENCES
Ali, Z., Talha, M., & Alsulaiman, M. (2017). A practical
approach: Design and implementation of a healthcare
software for screening of dysphonic patients. IEEE
Access, 5, 5844–5857.
Al-Nasheri, A., Muhammad, G., Alsulaiman, M., Ali, Z.,
Malki, K. H., Mesallam, T. A., & Ibrahim, M. F. (2017).
Voice pathology detection and classification using
auto-correlation and entropy features in different
frequency regions. Ieee Access, 6, 6961–6974.
Boone, D. R., McFarlane, S. C., Von Berg, S. L., & Zraick,
R. I. (2005). The voice and voice therapy.
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for
support vector machines. ACM Transactions on
Intelligent Systems and Technology, 2(3), 27:1-27:27.
Cohen, S. M., Kim, J., Roy, N., Asche, C., & Courey, M.
(2012). Prevalence and causes of dysphonia in a large
treatment-seeking population. The Laryngoscope,
122(2), 343–348.
Johns, M. M., Sataloff, R. T., Merati, A. L., & Rosen, C. A.
(2010). Shortfalls of the American Academy of
Otolaryngology–Head and Neck Surgery’s Clinical
practice guideline: Hoarseness (dysphonia).
Otolaryngology-Head and Neck Surgery, 143(2), 175–
177.
Jungermann, F. (2009). Information extraction with
rapidminer. Proceedings of the GSCL
Symposium’Sprachtechnologie Und EHumanities, 50–
61.
Kiss, G., Sztahó, D., & Vicsi, K. (2013). Language
independent automatic speech segmentation into
phoneme-like units on the base of acoustic distinctive
features. 2013 IEEE 4th International Conference on