Paraphrases for Statistical Machine Translation

Date created: 
2015-09-25
Identifier: 
etd9252
Keywords: 
Natural Language Processing
Statistical Machine Translation
Paraphrase Database
Resource poor languages
Morphologically complex languages
Graph-based semi-supervised method
Multilingual resources
PPDB
Out-of-vocabulary
OOV
Abstract: 

Statistical Machine Translation (SMT) is the task of automatic translation between two natural languages (source language and target language) by using bilingual corpora. To accomplish this goal, machine learning models try to capture human translation patterns inside a bilingual corpus. An open challenge for SMT is finding translations for phrases which are missing in the training data (out-of-vocabulary phrases). We propose to use paraphrases to provide translations for out-of-vocabulary (OOV) phrases. We compare two major approaches to automatically extract paraphrases from corpora: distributional profile (DP) and bilingual pivoting. The multilingual Paraphrase Database (PPDB) is a freely available automatically created (using bilingual pivoting) resource of paraphrases in multiple languages. We show that a graph propagation approach that uses PPDB paraphrases can be used to improve overall translation quality. We provide an extensive comparison with previous work and show that our PPDB-based method improves the BLEU score by up to 1.79 percent points. We show that our approach improves on the state of the art in three different settings: when faced with limited amount of parallel training data; a domain shift between training and test data; and handling a morphologically complex source language.

Document type: 
Thesis
Rights: 
This thesis may be printed or downloaded for non-commercial research and scholarly purposes. Copyright remains with the author.
File(s): 
Senior supervisor: 
Anoop Sarkar
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.
Statistics: