Vocabulary Modifications for Domain-adaptive Pretraining of Clinical Language Models

Anastasios Lamproudis, Aron Henriksson, Hercules Dalianis

2022

Abstract

Research has shown that using generic language models – specifically, BERT models – in specialized domains may be sub-optimal due to domain differences in language use and vocabulary. There are several techniques for developing domain-specific language models that leverage the use of existing generic language models, including continued and domain-adaptive pretraining with in-domain data. Here, we investigate a strategy based on using a domain-specific vocabulary, while leveraging a generic language model for initialization. The results demonstrate that domain-adaptive pretraining, in combination with a domain-specific vocabulary – as opposed to a general-domain vocabulary – yields improvements on two downstream clinical NLP tasks for Swedish. The results highlight the value of domain-adaptive pretraining when developing specialized language models and indicate that it is beneficial to adapt the vocabulary of the language model to the target domain prior to continued, domain-adaptive pretraining of a generic language model.

Download


Paper Citation


in Harvard Style

Lamproudis A., Henriksson A. and Dalianis H. (2022). Vocabulary Modifications for Domain-adaptive Pretraining of Clinical Language Models. In Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2022) - Volume 5: HEALTHINF; ISBN 978-989-758-552-4, SciTePress, pages 180-188. DOI: 10.5220/0010893800003123


in Bibtex Style

@conference{healthinf22,
author={Anastasios Lamproudis and Aron Henriksson and Hercules Dalianis},
title={Vocabulary Modifications for Domain-adaptive Pretraining of Clinical Language Models},
booktitle={Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2022) - Volume 5: HEALTHINF},
year={2022},
pages={180-188},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010893800003123},
isbn={978-989-758-552-4},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2022) - Volume 5: HEALTHINF
TI - Vocabulary Modifications for Domain-adaptive Pretraining of Clinical Language Models
SN - 978-989-758-552-4
AU - Lamproudis A.
AU - Henriksson A.
AU - Dalianis H.
PY - 2022
SP - 180
EP - 188
DO - 10.5220/0010893800003123
PB - SciTePress