Training Data Annotation for Segmentation Classification in Simultaneous Translation

Date created: 
2016-05-09
Identifier: 
etd9597
Keywords: 
Simultaneous Translation
Machine Translation
Segmentation
Abstract: 

Segmentation of the incoming speech stream and translating segments incrementally is a commonly used technique that improves latency in spoken language translation. Previous work of Oda et al. 2014 [1] has explored creating training data for segmentation by finding segments that maximize translation quality with a user-defined bound on segment length.In this work, we provide a new algorithm that uses Pareto-optimality to find good segment boundaries that can balance the trade-off between latency versus translation quality. We compare against the state-of-the-art greedy algorithm from Oda et al. 2014. Our experimental results show that we can improve latency by up to 12% without harming theBleuscore for the same average segment length. Another benefit is that for any segment size,Pareto-optimal segments maximize both latency and translation quality.

Document type: 
Thesis
Rights: 
This thesis may be printed or downloaded for non-commercial research and scholarly purposes. Copyright remains with the author.
File(s): 
Senior supervisor: 
Anoop Sarkar
Department: 
Applied Sciences: School of Computing Science
Thesis type: 
(Thesis) M.Sc.
Statistics: