Author: Shi, Zhongmin
In this thesis, we study the extraction of biomedical relations, specifically, the extraction of bacterial protein subcellular localizations (BPLs), from abstracts of biomedical scientific articles. A BPL indicates where the protein is located in the bacterium. The extraction of BPLs provides a valuable clue to the biological function of the protein and helps to identify suitable drug, vaccine and diagnostic targets. The work is motivated by our collaboration with researchers in molecular biology, with the goal of automatically extracting BPLs from text to expand their BPL database. Our research on the BPL extraction focuses on two learning perspectives: generative and discriminative learning. We propose a three-tier system that integrates a generative model, a discriminative model and a graph-based model to extract BPLs from MEDLINE abstracts. The generative model integrates syntactic features and domain-specific semantic features on the parse tree for a sentence. The model is capable of identifying biomedical named-entities and relations simultaneously from a large set of noisy data and exhibits a significant improvement on the overall performance against a supervised alternative. We also introduce a discriminative model that applies rich syntactic features from parse trees to extract relations from single sentences. A hybrid pipelined system that integrates generative and discriminative models shows a further improvement against the generative model alone. Finally we implement a graph model, Biomedical Relation Networks (BRNs), to identify global and hidden relations from multiple sentences and documents. Based on binary predictions of the generative and discriminative models, a BRN integrates ontological and functional relations in a directed weighted cyclic graph, and is capable of extracting BPLs distinguished from others and detecting inconsistent predictions. The study is new to the biomedical natural language processing community in terms of the specific molecular biology task and the capture of the ternary relation among bacterium, protein and location. Our key contributions also lie in learning from noisy data, integrating syntactic and semantic features to extract named-entities and relations simultaneously and establishing an annotated BPL corpus that will benefit relation extraction research.
Copyright is held by the author.
The author has not granted permission for the file to be printed nor for the text to be copied and pasted. If you would like a printable copy of this thesis, please contact email@example.com.
Member of collection