Leveraging diverse sources in statistical machine translation

Author: 
Date created: 
2013-07-29
Identifier: 
etd8018
Keywords: 
Statistical machine translation
SMT
Machine translation
Ensemble decoding
Stacking
Diverse sources
Domain adaptation
Triangulation
Model combination
Mixture operation
Ensemble learning
Graph propagation
Oov
Out-of-vocabulary translation mining
Graph propagation for SMT
Lexicon induction
Abstract: 

Statistical machine translation (SMT) is often faced with the problem of having insufficient training data for many language pairs. We propose several approaches to leveraging other available sources in SMT systems to enhance the quality of translation. Particularly, we propose approaches suitable in these four scenarios: 1. when an additional parallel corpus is available; 2. when parallel corpora between the source language and a third language and between that language and the target language are available; 3. when an abundant source-language monolingual corpus is available; 4. when no additional resource is available. In the heart of these solutions lie two novel approaches: ensemble decoding and a graph propagation approach for paraphrasing out-of-vocabulary words. Ensemble decoding combines a number of translation systems dynamically at the decoding step. Our experimental results show that ensemble decoding outperforms various strong baselines including mixture models, the current state-of-the-art for domain adaptation in machine translation. We extend ensemble decoding to do triangulation on-the-fly when there exist parallel corpora between the source language and one or multiple pivot languages and between those and the target language. These triangulated systems are dynamically combined together and possibly to a direct source-target system. Experiments in 12 different language pairs show significant improvements over the baselines in terms of BLEU scores. Ensemble decoding can also be used to apply stacking to statistical machine translation. Stacking is an ensemble learning approach that enhances the bias of the models. We show that stacking can consistently and significantly improve over the conventional SMT systems in two different language pairs and three different training sizes. In addition to ensemble decoding, we propose a novel approach to mining translations for OOV words using a monolingual corpus on the source-side language. We induce a lexicon by constructing a graph on the source language phrases and employ a graph propagation technique in order to find translations for those phrases. Experimental results in two different settings show that our graph propagation method significantly improves performance over two strong baselines under intrinsic and extrinsic evaluation metrics.

Document type: 
Thesis
Rights: 
Copyright remains with the author. The author granted permission for the file to be printed and for the text to be copied and pasted.
File(s): 
Senior supervisor: 
Anoop Sarkar
Fred Popowich
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) Ph.D.
Statistics: