SCALE 2018

Problem (1 of 2):

Resilient Machine Translation for New Domains

June 11 – August 10

Machine translation (MT) is an important application in our
increasingly connected world. Rapid advances in artificial
intelligence have led to dramatic improvements in data-driven MT
systems (i.e. statistical and neural MT). When an adequate quantity of
training data is available, systems can now produce translations of
sufficient accuracy to be usable in many practical scenarios.
However, the performance of any data-driven system critically depends
on whether the characteristics of its training data match the
characteristics of the data the system encounters in everyday use.

During the summer, we plan to investigate translation in three
scenarios where domain adaptation is needed: the informal language
common in social media; technical documents such as patents and
scientific articles; and noisy text output by optical character
recognition (OCR) systems. Each of these domains poses important
vocabulary adaptation challenges — scientific documents contain
unknown technical jargon, while social media and OCR results contain
misspellings or variants of known vocabulary. Some of the techniques
for increasing robustness that we will explore are use of monolingual
word embeddings, adaptive data filtering, bilingual dictionary mining,
and models of lexical variation. We will also compare the performance
of statistical and neural systems on conditions that vary the amount
of available data, the degree of domain mismatch, and approaches to

Problem (2 of 2):

Multilingual Text Search in Image Collections

June 11 – August 10

Recent advances in computer vision have led to novel architectures
based on deep learning that can dramatically improve performance of
optical character recognition (OCR). In SCALE’18 we will investigate
keyword and key-phrase search in multilingual image corpora. Key
avenues of exploration include core aspects of the OCR pipeline (text
localization, script/language identification, and decoding) and
retrieval techniques, as well as topics such as generation and use of
synthetic data and domain adaptation in low-resource conditions.

Our first goal is to improve accuracy in each of the core OCR
components. Second we will demonstrate and evaluate the full OCR
processing pipeline in several languages; the primary extrinsic
evaluation will be based on keyword search, but other metrics will be
investigated as well. Additionally we will explore a combined OCR and
machine translation pipeline.



Human Language Technology Center of Excellence