Unsupervised morphological segmentation for statistical machine translation

Date created: 
Natural language processing
Statistical machine translation
Morphology generation
Segmentation-based translation

Statistical Machine Translation (SMT) techniques often assume the word is the basic unit of analysis. These techniques work well when producing output in languages like English, which has simple morphology and hence few word forms, but tend to perform poorly on languages like Finnish with very complex morphological systems with a large vocabulary. This thesis examines various methods of augmenting SMT models to use morphological information to improve the quality of translation into morphologically rich languages, comparing them on an English-Finnish translation task. We investigate the use of the three main methods to integrate morphological awareness into SMT systems: factored models, segmented translation, and morphology generation models. We incorporate previously proposed unsupervised morphological segmentation methods into the translation model and combine this segmentation-based system with a Conditional Random Field morphology prediction model. We find the morphology aware models yield significantly more fluent translation output compared to a baseline word-based model.

Document type: 
Copyright remains with the author. The author granted permission for the file to be printed and for the text to be copied and pasted.
Senior supervisor: 
Anoop Sarkar
Applied Science: School of Computing Science
Thesis type: 
((Computing Science) Thesis) M.Sc.