Gu, Baohua

Resource type

Thesis

Thesis type

(Thesis) Ph.D.

Date created

2008

Authors/Contributors

Author: Gu, Baohua

Abstract

Named Entities (NEs) in biomedical text refer to objects that are of interest to biomedical researchers, such as proteins and genes. Accurately identifying them is important for Biomedical Natural Language Processing (BioNLP). Focusing on biomedical named entity recognition (BioNER), this thesis presents a number of novel results on the following topics of this area. First, we study whether corpus based statistical learning methods, currently dominant in BioNER, would achieve close-to-human performance by using larger corpora for training. We find that a significantly larger corpus is required to achieve a performance significantly higher than the state-of-the-art obtained on the GENIA corpus. This finding suggests the hypothesis is not warranted. Second, we address the issue of nested NEs and propose a level-by-level method that learns a separate NER model for each level of the nesting. We show that this method works well for both nested NEs and non-nested NEs. Third, we propose a method that builds NEs on top of base NP chunks, and examine the associated benefits as well as problems. Our experiments show that this method, though inferior to statistical word based approaches, has the potential to outperform them, provided that domain-specific rules can be designed to determine NE boundaries based on NP chunks. Fourth, we present a method to do BioNER in the absence of annotated corpora. It uses an NE dictionary to label sentences, and then uses these partially labeled sentences to iteratively train an SVM model in the manner of semi-supervised learning. Our experiments validate the effectiveness of the method. Finally, we explore BioNER in Chinese text, an area that has not been studied by previous work. We train a character-based CRF model on a small set of manually annotated Chinese biomedical abstracts. We also examine the features usable for the model. Our evaluation suggests that corpus-based statistical learning approaches hold promise for this particular task. All the proposed methods are novel and have applicability beyond the NE types and the languages considered here, and beyond the BioNER task itself.

Keywords

Copyright statement

Copyright is held by the author.

Permissions

The author has not granted permission for the file to be printed nor for the text to be copied and pasted. If you would like a printable copy of this thesis, please contact summit-permissions@sfu.ca.

Scholarly level

Graduate student (PhD)

Language

English

Member of collection

Computing Science Theses

Download file	Size
etd4053.pdf	5.27 MB

Recognizing named entities in biomedical texts

Keywords

Views & downloads - as of June 2023