Machine translation for non-English named entity recognition

Date created
2012-08-15
Authors/Contributors
Abstract
Parallel corpora, Often exploited for Machine Translation, have recently been used for mono- lingual purposes. Borrowing annotation from resource rich languages into resource-scarce languages is a technique known as Annotation Projection that uses parallel corpora and word alignment to transfer annotations; It has been introduced as an alternative to the tedious and time-consuming task of building hand-annotated corpora for new languages. This technique is especially effective for semantic annotations such as Named Entity, since they are less affected by translation. In this work we test the applicability of annotation projection to NER through two paradigms: One focusing on generating new German data and annotating it using English annotated data and another that focuses on adding new annotations to already existing German text and using them as training features. We accompany machine translation with annotation projection which not only removes the restriction to parallel corpora and expands the methodology but also allows the use of monolingual hand-annotated corpora, relieving the bottleneck of English-side annotations quality. We develop four training corpora by applying the two paradigms on two different corpora: parallel and singular. We train an NER model on each corpus for evaluation and compare the model quality with a baseline. The results show that the projected annotations can be noisy and inconsistent. Therefore, using them as target annotations reduces corpus and model quality; Whereas, as features alongside the original annotations they significantly improve the quality.
Document
Identifier
etd7355
Copyright statement
Copyright is held by the author.
Permissions
The author granted permission for the file to be printed and for the text to be copied and pasted.
Scholarly level
Member of collection
Attachment Size
etd7355_MSaghaei.pdf 992.68 KB