EntityBERT: Entity-centric Masking Strategy for Model Pretraining for the Clinical Domain
AffiliationUniversity of Arizona
MetadataShow full item record
CitationLin, C., Miller, T., Dligach, D., Bethard, S., & Savova, G. (2021, June). EntityBERT: Entity-centric Masking Strategy for Model Pretraining for the Clinical Domain. In Proceedings of the 20th Workshop on Biomedical Language Processing (pp. 191-201).
RightsCopyright © 2021 Association for Computational Linguistics. Licensed on a Creative Commons Attribution 4.0 International License.
Collection InformationThis item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at firstname.lastname@example.org.
AbstractTransformer-based neural language models have led to breakthroughs for a variety of natural language processing (NLP) tasks. However, most models are pretrained on general domain data. We propose a methodology to produce a model focused on the clinical domain: continued pretraining of a model with a broad representation of biomedical terminology (PubMedBERT) on a clinical corpus along with a novel entity-centric masking strategy to infuse domain knowledge in the learning process. We show that such a model achieves superior results on clinical extraction tasks by comparing our entity-centric masking strategy with classic random masking on three clinical NLP tasks: cross-domain negation detection (Wu et al., 2014), document time relation (DocTimeRel) classification (Lin et al., 2020b), and temporal relation extraction (Wright-Bettner et al., 2020). We also evaluate our models on the PubMedQA(Jin et al., 2019) dataset to measure the models’ performance on a nonentity-centric task in the biomedical domain. The language addressed in this work is English. © 2021 Association for Computational Linguistics
NoteOpen access journal
VersionFinal published version
Except where otherwise noted, this item's license is described as Copyright © 2021 Association for Computational Linguistics. Licensed on a Creative Commons Attribution 4.0 International License.