SCALE 2023

Translation of Conversational Speech

June 12th to  August 11th 

**********************************************************************************************************

Speech translation has been a holy grail in human language technology for several decades; it was specifically called out by Bill Clinton during the 2000 State of the Union Address: “Soon researchers will bring us devices that can translate foreign languages as fast as you can talk”. Recent advances in deep learning have finally resulted in significant improvements in this technology, but research has been concentrated in two scenarios: (a) translation of public speeches in formal settings (e.g., TED Talks, Parliamentary speech) and (b) real-time interpretation.

SCALE’23 will focus on translation in informal conversational speech. In contrast to prepared addresses, translation of interpersonal conversations is challenging due to differences in acoustics, disfluencies, speaking style, vocabulary, and dialectal variation. In real-time interpretation a key objective is minimizing decoding latency, however, we plan to focus on improving translation accuracy.

Our goals at the workshop are to understand how current state-of-the-art technologies work for translation of conversational speech, identify the open areas of research unique to this scenario, and to advance research in this field. Research areas that we will explore include:

  • End-to-End architectures for neural speech translation (e.g., using tools like ESPNet) when 100+ hours of foreign speech with transcriptions and English translations are available. We have such data in some languages, such as the Fisher Spanish translations and the IWSLT 2022 Tunisian Arabic shared task.
  • Adapting cascaded systems that perform speech recognition and translation separately, to improve resulting translations. This is important in the common case where training data for the component technologies is available, but there is less data than is needed for a successful end-to-end approach.
  • Leveraging pretrained multimodal and multilingual representations (e.g., wav2vec 2.0, data2vec, mBART, etc…)
  • Data augmentation techniques, including synthetically generated audio, synthetically generated translations, and semi-supervised methods.
  • We will focus on translation into English text, however, research in multilingual / multisource models is of interest.

Human Language Technology Center of Excellence