SCALE 2015

Speech-to-text translation for low-resource languages

Automated translation of human speech has been a long-term research goal; it was mentioned by President Clinton in his 2000 State of the Union address, and more recently, prototypes were announced by Skype and Google in January 2015. Spoken language differs significantly from the high-quality, grammatical inputs typically provided to Machine Translation (MT) systems; difficulties include pauses, disfluencies, corrections, interruptions, repetition, and restarts. Speech To Text Translation (STTT) thus suffers from problems intrinsic to speech recognition, difficulties inherent in translation, and problems that arise from the composition of the two.

This workshop investigated methods to develop and improve an integrated STTT pipeline. More than one language pair was studied to avoid language-specific solutions, and to explore the effect of differently resourced languages (i.e., languages with higher speech error rates). Multiple approaches to the problem were considered, and candidate research topics included: domain adaptation for training MT systems on speech output; supplementing translation training data; leveraging ASR lattices; and evaluation of downstream metrics that model different usage scenarios. We expected these efforts to produce important baselines and a comprehensive quantitative and qualitative understanding of the problems of low-resource speech-to-text translation.


Chinese Entity Discovery and Linking

Names written in Chinese tend to be hard to identify, highly ambiguous, and difficult to map to the corresponding English name. SCALE 2015 planned to advance the state of the art in discovering and linking entity mentions across multiple documents, and tying those mentions to entries in a knowledge base. Entity discovery and linking (EDL) systems use a variety of analytics (such as named entity recognition and relation identification) to identify the referent of a given entity mention. SCALE 2015 focused on EDL over Chinese text, as described in the NIST Text Analysis Conference (TAC) Chinese Entity Discovery and Linking track. The goal of the summer effort was to improve the accuracy of individual underlying analytics, thereby increasing performance on the overall TAC EDL task. Given interest from SCALE participants, the SCALE 2015 system could be submitted as an entry in the upcoming NIST TAC Tri-lingual EDL evaluation.

Human Language Technology Center of Excellence