Machine translation (MT) is an important application in our
increasingly connected world. Rapid advances in artificial
intelligence have led to dramatic improvements in data-driven MT
systems (i.e. statistical and neural MT). When an adequate quantity of
training data is available, systems can now produce translations of
sufficient accuracy to be usable in many practical scenarios.
However, the performance of any data-driven system critically depends
on whether the characteristics of its training data match the
characteristics of the data the system encounters in everyday use.
During the summer of 2018, we investigated translation in three
scenarios where domain adaptation is needed: the informal language
common in social media; technical documents such as patents and
scientific articles; and noisy text output by optical character
recognition (OCR) systems. Each of these domains poses important
vocabulary adaptation challenges — scientific documents contain
unknown technical jargon, while social media and OCR results contain
misspellings or variants of known vocabulary. Some of the techniques
for increasing robustness that we explored are use of monolingual
word embeddings, adaptive data filtering, bilingual dictionary mining,
and models of lexical variation. We also compared the performance
of statistical and neural systems on conditions that vary the amount
of available data, the degree of domain mismatch, and approaches to
adaptation.
Recent advances in computer vision have led to novel architectures
based on deep learning that can dramatically improve performance of
optical character recognition (OCR). In SCALE’18 we investigated
keyword and key-phrase search in multilingual image corpora. Key
avenues of exploration include core aspects of the OCR pipeline (text
localization, script/language identification, and decoding) and
retrieval techniques, as well as topics such as generation and use of
synthetic data and domain adaptation in low-resource conditions.
Our first goal was to improve accuracy in each of the core OCR
components. Second we planned to demonstrate and evaluate the full OCR
processing pipeline in several languages; the primary extrinsic
evaluation was be based on keyword search, but other metrics were
investigated as well. Additionally we explorde a combined OCR and
machine translation pipeline.