Skip to main content

Approaches to handle scarce resources for Bengali statistical machine translation

Resource type
Thesis type
(Thesis) Ph.D.
Date created
2010
Authors/Contributors
Author: Roy, Maxim
Abstract
Machine translation (MT) is a hard problem because of the highly complex, irregular and diverse nature of natural language. MT refers to computerized systems that utilize software to translate text from one natural language into another with or without human assistance. It is impossible to accurately model all the linguistic rules and relationships that shape the translation process, and therefore MT has to make decisions based on incomplete data. In order to handle this incomplete data, a principled approach is to use statistical methods to make optimum decisions given incomplete data. Statistical machine translation (SMT) uses a probabilistic framework to automatically translate text from one language to another. Using the co-occurrence counts of words and phrases from the bilingual parallel corpora where sentences are aligned with their translation, SMT learns the translation of words and phrases. We apply SMT techniques for translation between Bengali and English. SMT systems requires a significant amount of bilingual data between language pairs to achieve significant translation accuracy. However, being a low-density language, such resources are not available in Bengali. So in this thesis, we investigate different language independent and dependent techniques which can be helpful to improve translate accuracy of Bengali SMT systems. We explore the transliteration module, prepositional and Bengali compound word handling module in the context of Bengali to English SMT. Further we look into semi-supervised techniques and active learning techniques in Bengali SMT to deal with scarce resources. Also due to different word orders in Bengali and English, we also propose different syntactic phrase reordering techniques for Bengali SMT. We also contributed toward Bengali SMT by creating a new test set, lexicon and by developing Bengali text processing tools such as tokenizer, sentence segmenter, and morphological analyzer. Overall the main objective of this thesis is to make a contribution towards Bengali language processing, provide a general foundation for conducting research in Bengali SMT and improve the quality of Bengali SMT.
Document
Copyright statement
Copyright is held by the author.
Permissions
The author has not granted permission for the file to be printed nor for the text to be copied and pasted. If you would like a printable copy of this thesis, please contact summit-permissions@sfu.ca.
Scholarly level
Language
English
Member of collection
Download file Size
etd5938.pdf 947.95 KB

Views & downloads - as of June 2023

Views: 19
Downloads: 1