Bilingual language models (Bi-LMs) refer to language models that are modeled using both source and target words in a parallel corpus. While translating a source sentence to a target language, the decoder in phrase-based machine translation system breaks down the source sentence into phrases. It then translates each phrase into the target language. While decoding each phrase, the decoder has very little information about source words that are outside the current phrase in consideration. Bi-LMs have been used to provide more information about source words outside the current phrase. Bi-LMs are estimated by first creating bitoken sequences using a parallel corpus and the word alignments between the source and target words in that corpus. When creating the bitoken sequences, the vocabulary expands considerably and Bi-LMs suffer due to this huge vocabulary which in turn increases the sparsity of the language models. In previous work, bitokens were created by first replacing each word in the parallel corpus either by their part-of-speech tags or word classes after clustering using the Brown clustering algorithm. Both of these approaches only take into account words that are direct translations of each other as they only depend on word alignments between the source word and target word in the bitokens. In this thesis, we propose the use of bilingual word embeddings as a first step to reduce the vocabulary of the bitokens. Bilingual word embeddings are a low dimensional representation of words trained on a parallel corpus of aligned sentences in two languages. Using bilingual word embeddings to build Bi-LMs for machine translation is significantly better than the previous state of the art that uses Bi-LMs with an increase of 1.4 BLEU points in our experiments.
Copyright is held by the author.
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Member of collection