Bilingual Language Models using Word Embeddings for Machine Translation

Date created: 
Word Embeddings
Bilingual Word Embeddings
Bilingual Language Models
Language Models
Machine Translation

Bilingual language models (Bi-LMs) refer to language models that are modeled using both source and target words in a parallel corpus. While translating a source sentence to a target language, the decoder in phrase-based machine translation system breaks down the source sentence into phrases. It then translates each phrase into the target language. While decoding each phrase, the decoder has very little information about source words that are outside the current phrase in consideration. Bi-LMs have been used to provide more information about source words outside the current phrase. Bi-LMs are estimated by first creating bitoken sequences using a parallel corpus and the word alignments between the source and target words in that corpus. When creating the bitoken sequences, the vocabulary expands considerably and Bi-LMs suffer due to this huge vocabulary which in turn increases the sparsity of the language models. In previous work, bitokens were created by first replacing each word in the parallel corpus either by their part-of-speech tags or word classes after clustering using the Brown clustering algorithm. Both of these approaches only take into account words that are direct translations of each other as they only depend on word alignments between the source word and target word in the bitokens. In this thesis, we propose the use of bilingual word embeddings as a first step to reduce the vocabulary of the bitokens. Bilingual word embeddings are a low dimensional representation of words trained on a parallel corpus of aligned sentences in two languages. Using bilingual word embeddings to build Bi-LMs for machine translation is significantly better than the previous state of the art that uses Bi-LMs with an increase of 1.4 BLEU points in our experiments.

Document type: 
This thesis may be printed or downloaded for non-commercial research and scholarly purposes. Copyright remains with the author.
Senior supervisor: 
Anoop Sarkar
Fred Popowich
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.