Sabharwal, Jasneet Singh

Resource type

Thesis

Thesis type

(Thesis) M.Sc.

Date created

2016-05-19

Authors/Contributors

Author (aut): Sabharwal, Jasneet Singh

Abstract

Bilingual language models (Bi-LMs) refer to language models that are modeled using both source and target words in a parallel corpus. While translating a source sentence to a target language, the decoder in phrase-based machine translation system breaks down the source sentence into phrases. It then translates each phrase into the target language. While decoding each phrase, the decoder has very little information about source words that are outside the current phrase in consideration. Bi-LMs have been used to provide more information about source words outside the current phrase. Bi-LMs are estimated by first creating bitoken sequences using a parallel corpus and the word alignments between the source and target words in that corpus. When creating the bitoken sequences, the vocabulary expands considerably and Bi-LMs suffer due to this huge vocabulary which in turn increases the sparsity of the language models. In previous work, bitokens were created by first replacing each word in the parallel corpus either by their part-of-speech tags or word classes after clustering using the Brown clustering algorithm. Both of these approaches only take into account words that are direct translations of each other as they only depend on word alignments between the source word and target word in the bitokens. In this thesis, we propose the use of bilingual word embeddings as a first step to reduce the vocabulary of the bitokens. Bilingual word embeddings are a low dimensional representation of words trained on a parallel corpus of aligned sentences in two languages. Using bilingual word embeddings to build Bi-LMs for machine translation is significantly better than the previous state of the art that uses Bi-LMs with an increase of 1.4 BLEU points in our experiments.

Keywords

Identifier

etd9617

Copyright statement

Copyright is held by the author.

Permissions

This thesis may be printed or downloaded for non-commercial research and scholarly purposes.

Scholarly level

Graduate student (Masters)

Supervisor or Senior Supervisor

Thesis advisor (ths): Sarkar, Anoop

Thesis advisor (ths): Popowich, Fred

Member of collection

Computing Science Theses

Download file	Size
etd9617_JSabharwal.pdf	1.28 MB

Bilingual Language Models using Word Embeddings for Machine Translation

Keywords

Views & downloads - as of June 2023