SCALE 2016


Existing research in Computer Assisted Translation (CAT) has aimed to increase the efficiency of human translators in their workflows. Computer Assisted Discovery Extraction and Translation (CADET), proposed here, will expand this vision to include the discovery of relevant material, and the extraction of useful information for downstream consumption.  For example, a multilingual business consultant working with large collections of international data could be expected to discover which particular documents are of greatest relevance to a client, then find, summarize and translate the most important pieces of information into a text document or spreadsheet compiled in English.  SCALE 2013 “Report Linking” focused on this sort of knowledge worker (such as  someone compiling information for writing a Wikipedia page); the focus there was on information extraction for finding connections (“links”) between facts expressed redundantly across multiple documents.  In this SCALE we evaluated the tasks of Discovery, Extraction, and Translation together as a joint process, and explored how automated HLT systems can measurably improve human efficiency.  That efficiency was measured in terms of how much human feedback is required to find sentences bearing interesting material (such as what might be extracted to populate a Wikipedia Infobox).

Two contexts were considered: (1) High-Resource (HR) scenarios, where our focus is on allowing a user to incrementally correct and extend system capabilities; and (2) Low-Resource (LR) scenarios, with the aim to rapidly bootstrap from little or no existing capability. Setting (2) aligns by design with goals of the DARPA LORELEI program. To further that connection, our experiments focused primarily on Russian, as it is morphologically rich and an example language during Year 1 of LORELEI.


Knowledge-Rich Statistical Machine Translation

The dominant research and commercial approach to machine translation is the statistical or “data-driven” approach, where statistical systems are trained on large numbers of example translations. This paradigm has largely supplanted the historically dominant “knowledge-rich” approach, which employs human domain experts to manually craft translation rules. These latter systems continue to be used, particularly in low-resource settings (where little to no parallel data exists), for morphologically rich languages, and in specialized domains. A key difference between the approaches is in their treatment of linguistic phenomena. Knowledge-rich systems construct rich morpho-syntactic representations of the input and output languages, in contrast to data-driven systems, which eschew linguistics in favor of raw statistics over symbol mappings.

This SCALE’16 task focused on an integrated hybrid approach to statistical machine translation. We focused on two approaches. The first was to extract from knowledge-rich systems phrase pairs and hierarchical rules suitable for use in modern statistical translation engines. Concurrently, we also looked at ways of extending the statistical system with information from linguistic analyses produced by the knowledge-rich systems. Our evaluation investigated a range of typologically diverse language pairs, domains, and resource settings; furthermore, it encompassed standard MT metrics like BLEU, but also a document-based metric designed to gauge reader comprehension.

Human Language Technology Center of Excellence