Investigations into the value of labeled and unlabeled data in biomedical entity recognition and word sense disambiguation

Date created: 
2021-03-31
Identifier: 
etd21291
Keywords: 
Biomedical named entity recognition
Word sense disambiguation
Word sense induction
Abstract: 

Human annotations, especially in highly technical domains, are expensive and time consuming togather, and can also be erroneous. As a result, we never have sufficiently accurate data to train andevaluate supervised methods. In this thesis, we address this problem by taking a semi-supervised approach to biomedical namedentity recognition (NER), and by proposing an inventory-independent evaluation framework for supervised and unsupervised word sense disambiguation. Our contributions are as follows: We introduce a novel graph-based semi-supervised approach to named entity recognition(NER) and exploit pre-trained contextualized word embeddings in several biomedical NER tasks. We propose a new evaluation framework for word sense disambiguation that permits a fair comparison between supervised methods trained on different sense inventories as well as unsupervised methods without a fixed sense inventory.

Document type: 
Thesis
Rights: 
This thesis may be printed or downloaded for non-commercial research and scholarly purposes. Copyright remains with the author.
File(s): 
Supervisor(s): 
Anoop Sarkar
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) Ph.D.
Statistics: