Haffari, Gholam Reza

Resource type

Thesis

Thesis type

(Thesis) Ph.D.

Date created

2009

Authors/Contributors

Author: Haffari, Gholam Reza

Abstract

Statistical Machine Translation (SMT) models learn how to translate by examining a bilingual parallel corpus containing sentences aligned with their human-produced translations. However, high quality translation output is dependent on the availability of massive amounts of parallel text in the source and target languages. There are a large number of languages that are considered low-density, either because the population speaking the language is not very large, or even if millions of people speak the language, insufficient online resources are available in that language. This thesis covers machine learning approaches for dealing with such situations in statistical machine translation where the amount of available bilingual data is limited. The problem of learning from insufficient labeled training data has been dealt with in machine learning community under two general frameworks: (i) Semi-supervised Learning, and (ii) Active Learning. The complex nature of machine translation task poses severe challenges to most of the algorithms developed in machine learning community for these two learning scenarios. In this thesis, I develop semi-supervised learning as well as active learning algorithms to deal with the shortage of bilingual training data for Statistical Machine Translation task, specific to cases where there is shortage of bilingual training data. This dissertation provides two approaches, unified in what is called the bootstrapping framework, to this problem.

Keywords

Copyright statement

Copyright is held by the author.

Scholarly level

Graduate student (PhD)

Language

English

Member of collection

Computing Science Theses

Download file	Size
ETD4896.pdf	1.35 MB

Machine learning approaches for dealing with limited bilingual data in statistical machine translation

Keywords

Views & downloads - as of June 2023