Experimental comparison of discriminative learning approaches for Chinese word segmentation

Author: 
Date created: 
2008
Keywords: 
Chinese language -- Data processing
Text processing(Computer science)
Machine Learning
Computational Linguistics
Word segmentation
Machine learning
Natural language processing
Abstract: 

Natural language processing tasks assume that the input is tokenized into individual words. In languages like Chinese, however, such tokens are not available in the written form. This thesis explores the use of machine learning to segment Chinese sentences into word tokens. We conduct a detailed experimental comparison between various methods for word segmentation. We have built two Chinese word segmentation systems and evaluated them on standard data sets. The state of the art in this area involves the use of character-level features where the best segmentation is found using conditional random fields (CRF). The first system we implemented uses a majority voting approach among different CRF models and dictionary-based matching, and it outperforms the individual methods. The second system uses novel global features for word segmentation. Feature weights are trained using the averaged perceptron algorithm. By adding global features, performance is significantly improved compared to character-level CRF models.

Description: 
The author has placed restrictions on the PDF copy of this thesis. The PDF is not printable nor copyable. If you would like the SFU Library to attempt to contact the author to get permission to print a copy, please email your request to summit-permissions@sfu.ca.
Language: 
English
Document type: 
Thesis
Rights: 
Copyright remains with the author
File(s): 
Senior supervisor: 
A
Department: 
School of Computing Science - Simon Fraser University
Thesis type: 
(Computing Science) Thesis (M.Sc.)
Statistics: