Recognizing named entities in biomedical texts

Date created: 
Natural Language Processing (Computer Science)
Text Processing (Computer Science)
Supervised Learning (Machine Learning)
Information Retrieval
Computational Linguistics
Biomedical Named Entity Recognition
Support Vector Machine
Conditional Random Field
Annotated Corpus
Biomedical Text
Chinese Text

Named Entities (NEs) in biomedical text refer to objects that are of interest to biomedical researchers, such as proteins and genes. Accurately identifying them is important for Biomedical Natural Language Processing (BioNLP). Focusing on biomedical named entity recognition (BioNER), this thesis presents a number of novel results on the following topics of this area. First, we study whether corpus based statistical learning methods, currently dominant in BioNER, would achieve close-to-human performance by using larger corpora for training. We find that a significantly larger corpus is required to achieve a performance significantly higher than the state-of-the-art obtained on the GENIA corpus. This finding suggests the hypothesis is not warranted. Second, we address the issue of nested NEs and propose a level-by-level method that learns a separate NER model for each level of the nesting. We show that this method works well for both nested NEs and non-nested NEs. Third, we propose a method that builds NEs on top of base NP chunks, and examine the associated benefits as well as problems. Our experiments show that this method, though inferior to statistical word based approaches, has the potential to outperform them, provided that domain-specific rules can be designed to determine NE boundaries based on NP chunks. Fourth, we present a method to do BioNER in the absence of annotated corpora. It uses an NE dictionary to label sentences, and then uses these partially labeled sentences to iteratively train an SVM model in the manner of semi-supervised learning. Our experiments validate the effectiveness of the method. Finally, we explore BioNER in Chinese text, an area that has not been studied by previous work. We train a character-based CRF model on a small set of manually annotated Chinese biomedical abstracts. We also examine the features usable for the model. Our evaluation suggests that corpus-based statistical learning approaches hold promise for this particular task. All the proposed methods are novel and have applicability beyond the NE types and the languages considered here, and beyond the BioNER task itself.

The author has placed restrictions on the PDF copy of this thesis. The PDF is not printable nor copyable. If you would like the SFU Library to attempt to contact the author to get permission to print a copy, please email your request to
Document type: 
Copyright remains with the author
Senior supervisor: 
School of Computing Science - Simon Fraser University
Thesis type: 
Thesis (Ph.D.)