SCALE 2020

Speech Workshop:

End-to-End Automatic Speech Recognition

June 8th-August 7th, 2020 (9 weeks), Baltimore, Maryland


Task Description

Within the past decade, the application of deep learning in the field of automated speech recognition (ASR) has resulted in significantly more accurate and robust systems. Most top-performing systems depend on a combination of deep learning and traditional techniques (DNN+HMM), but these systems are highly complex and specialized. New research approaches utilizing end-to-end ASR models (e.g. DeepSpeech, RNN-Transducer, Jasper, and Espresso- Attention) provide promising performance with simplified training and deployment pipelines. However, the transition of technology from limited research scenarios to more realistic application settings can often present new challenges. Therefore, the goal of this workshop is to improve end- to-end ASR performance on
informal conversational speech in multiple languages and acoustic environments. This will require work on specific sub-topics including:

End-to-End ASR Improvement

A key goal is to incorporate the latest external research techniques to improve performance for non-English languages with relatively high training resources (100-1500 hours). This topic includes investigation of performance issues in each language, and development of improved algorithms, architectures, and recipes to increase accuracy.

Training Simplification

Customized approaches per language make it difficult to retrain ASR models for new domains and languages. Another research area is to develop simpler, unified recipes that allow retraining of systems with a straightforward, standardized pipeline that can match the performance of customs solutions.

Multilingual Models

Training a single multilingual model provides an even more attractive solution to the retraining problem. This may also allow a graceful solution when multiple languages are used in an audio file, or in the presence of code-switching.

Text Search

A major application of ASR is the ability to perform text searches on audio documents. Probabilistic lattices output by DNN+HMM systems provide significant benefits when performing keyword and phrase search, but current end-to-end research systems do not provide this. This research area will develop methods to improve performance of text search with end-to-end systems, including approaches such as lattice structures, N-best transcripts, query expansion, and named entity recognition.

Language Models

For flexible applications, it is important that external language models can be applied to end-to-end ASR systems. This area will focus on training and incorporating language models, both traditional N-gram and neural, as well as lexicon updates, and their effect on word error rate and text search performance.

Corpora and Metrics

Experimental validation is a key part of a SCALE workshop. This work will be evaluated on informal conversational speech from about five non-English languages. Baseline performance numbers will be
generated from existing Kaldi recipes and an initial end-to-end system, and the performance metrics will be word error rate and text search performance.


For Additional Information: 

Contact us at [email protected]


Human Language Technology Center of Excellence