6 CONCLUSION
In this study, we introduced a disease thesaurus as a
seed vector for semantic representation learning us-
ing a CAC construction method. We showed that
by selecting 264 disease-name feature words, the F1-
score of disease name estimation was 72.4, which
is about 10 points more accurate than the general-
purpose word semantic vector dictionary with a faster
linear SVM. We also showed that semantic represen-
tation learning of progress summaries in electronic
medical records could provide higher-level concepts
of disease names as a basis for disease name estima-
tion. The accuracy was 70%. The reason for the fail-
ure in estimating the higher-level concepts of the pre-
sumed disease names was that the higher-level con-
cepts of those disease names were not included in
the feature words due to the setting of disease names
with five or fewer letters in the selection of feature
words. Adding these correct disease names to the fea-
ture words could significantly improve accuracy.
Comparative experiments on disease name es-
timation using doc2vec showed that although dis-
tributed representation learning can be adapted to a
given corpus, the accuracy of disease name estima-
tion is significantly degraded by learned models with
significantly different data distributions. Although
the proposed method was able to solve this prob-
lem, the F1-score needs to be further improved for
practical use. In the future, we plan to integrate our
method with a Bert/transfer model (Yoshimasa Kawa-
zoe, 2021) learned from a large number of Japanese
medical texts to improve the accuracy of estimating
interpretable disease names to a practical level.
ACKNOWLEDGEMENTS
This work was supported by JSPS KAKENHI Grant
Number 20K11833. This study was approved by the
Ethical Review Committee of the Fukui University of
Technology and the Toyama University Hospital.
REFERENCES
Faruqui, M., Dodge, J., Jauhar, S. K., Dyer, C., Hovy, E.,
and Smith, N. A. (2015). Retrofitting word vectors
to semantic lexicons. In Proc. of NAACL-HLT, pages
1606–1615.
Keshi, I., Ikeuchi, H., and Kuromusha, K. (1996). Associa-
tive image retrieval using knowledge in encyclopedia
text. Systems and Computers in Japan, 27(12):53–62.
Keshi, I., Suzuki, Y., Yoshino, K., and Nakamura, S.
(2017). Semantically readable distributed represen-
tation learning for social media mining. In Proc. of
IEEE/WIC/ACM International Conference on Web In-
telligence (WI), pages 716–722.
Keshi, I., Suzuki, Y., Yoshino, K., and Nakamura, S.
(2018). Semantically readable distributed represen-
tation learning and its expandability using a word se-
mantic vector dictionary. IEICE TRANSACTIONS on
Information and Systems, E101-D(4):1066–1078.
Le, Q. V. and Mikolov, T. (2014). Distributed representa-
tions of sentences and documents. In Proc. of ICML,
pages 1188–1196.
Lema
ˆ
ıtre, G., Nogueira, F., and Aridas, C. K. (2017).
Imbalanced-learn: A python toolbox to tackle the
curse of imbalanced datasets in machine learning.
Journal of Machine Learning Research, 18(17):1–5.
Luo, H., Liu, Z., Luan, H., and Sun, M. (2015). Online
Learning of Interpretable Word Embeddings. In Proc.
of EMNLP, pages 1687–1692.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a).
Efficient estimation of word representations in vector
space. CoRR, abs/1301.3781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and
Dean, J. (2013b). Distributed representations of
words and phrases and their compositionality. CoRR,
abs/1310.4546.
Mikolov, T., Yih, W., and Zweig, G. (2013c). Linguistic
regularities in continuous space word representations.
In Proc. of NAACL, pages 746–751.
Sun, F., Guo, J., Lan, Y., Xu, J., and Cheng, X. (2016).
Sparse Word Embeddings Using `
1
Regularized On-
line Learning. In Proc. of IJCAI, pages 2915–2921.
Suzuki Takahiro, Doi Shunsuke, C. K. K. T. S. K.-i. S. G.
N. R. H. Y. H. M. M. Y. M. T. Y. H. K. E. (2019).
Development of discharge summary audit support ap-
plication by text mining. In Proc. of Japan Assciation
for Medical Informatics, volume 39, pages 667–668.
Tsujioka, K., Keshi, I., Nakagawa, H., and Hayashi, A.
(2022). Research on a method for constructing a
Japanese version of computer assisted coding us-
ing natural language processing. Health Information
Management, 34(1):56–64.
Yoshimasa Kawazoe, Daisaku Shibata, E. S. E. A. K. O.
(2021). A clinical specific bert developed using a huge
japanese clinical text corpus. PLoS One, 16(11)(9).
KDIR 2022 - 14th International Conference on Knowledge Discovery and Information Retrieval
272