The SCALE (Summer Camp for Applied Language Exploration) workshop, hosted each summer by the center, explores topics in human language technology. The workshop brings together top researchers from academia and industrial research labs, along with undergraduate and graduate students through our summer internship program to work together for several weeks. Topics for each year's SCALE are announced in the Fall.
SCALE 2016: Two workshops
Existing research in Computer Assisted Translation (CAT) has aimed to increase the efficiency of human translators in their workflows. Computer Assisted Discovery Extraction and Translation (CADET), proposed here, will expand this vision to include the discovery of relevant material, and the extraction of useful information for downstream consumption. For example, a multilingual business consultant working with large collections of international data could be expected to discover which particular documents are of greatest relevance to a client, then find, summarize and translate the most important pieces of information into a text document or spreadsheet compiled in English. SCALE 2013 "Report Linking" focused on this sort of knowledge worker (such as someone compiling information for writing a Wikipedia page); the focus there was on information extraction for finding connections ("links") between facts expressed redundantly across multiple documents. In this SCALE we will consider the tasks of Discovery, Extraction, and Translation together as a joint process, exploring how automated HLT systems can measurably improve human efficiency. This efficiency will be measured in terms of how much human feedback is required to find sentences bearing interesting material (such as what might be extracted to populate a Wikipedia Infobox).
Two contexts will be considered: (1) High-Resource (HR) scenarios, where our focus is on allowing a user to incrementally correct and extend system capabilities; and (2) Low-Resource (LR) scenarios, with the aim to rapidly bootstrap from little or no existing capability. Setting (2) aligns by design with goals of the DARPA LORELEI program. To further that connection, our experiments will focus primarily on Russian, as it is morphologically rich and an example language during Year 1 of LORELEI.
Knowledge-Rich Statistical Machine Translation
The dominant research and commercial approach to machine translation is the statistical or "data-driven" approach, where statistical systems are trained on large numbers of example translations. This paradigm has largely supplanted the historically dominant "knowledge-rich" approach, which employs human domain experts to manually craft translation rules. These latter systems continue to be used, particularly in low-resource settings (where little to no parallel data exists), for morphologically rich languages, and in specialized domains. A key difference between the approaches is in their treatment of linguistic phenomena. Knowledge-rich systems construct rich morpho-syntactic representations of the input and output languages, in contrast to data-driven systems, which eschew linguistics in favor of raw statistics over symbol mappings.
This SCALE’16 task will focus on an integrated hybrid approach to statistical machine translation. We will focus on two approaches. The first will be to extract from knowledge-rich systems phrase pairs and hierarchical rules suitable for use in modern statistical translation engines. Concurrently, we will look at ways of extending the statistical system with information from linguistic analyses produced by the knowledge-rich systems. Our evaluation will investigate a range of typologically diverse language pairs, domains, and resource settings; furthermore, it will encompass standard MT metrics like BLEU, but will also include a document-based metric designed to gauge reader comprehension.
SCALE 2015: Two workshops
Speech-to-text translation for low-resource languages
Automated translation of human speech has been a long-term research goal; it was mentioned by President Clinton in his 2000 State of the Union address, and more recently, prototypes were announced by Skype and Google in January 2015. Spoken language differs significantly from the high-quality, grammatical inputs typically provided to Machine Translation (MT) systems; difficulties include pauses, disfluencies, corrections, interruptions, repetition, and restarts. Speech To Text Translation (STTT) thus suffers from problems intrinsic to speech recognition, difficulties inherent in translation, and problems that arise from the composition of the two.
This workshop will investigate methods to develop and improve an integrated STTT pipeline. More than one language pair will be studied to avoid language-specific solutions, and to explore the effect of differently resourced languages (i.e., languages with higher speech error rates). Multiple approaches to the problem will be considered, and candidate research topics include: domain adaptation for training MT systems on speech output; supplementing translation training data; leveraging ASR lattices; and evaluation of downstream metrics that model different usage scenarios. We expect these efforts to produce important baselines and a comprehensive quantitative and qualitative understanding of the problems of low-resource speech-to-text translation.
Chinese Entity Discovery and Linking
Names written in Chinese tend to be hard to identify, highly ambiguous, and difficult to map to the corresponding English name. SCALE 2015 will advance the state of the art in discovering and linking entity mentions across multiple documents, and tying those mentions to entries in a knowledge base. Entity discovery and linking (EDL) systems use a variety of analytics (such as named entity recognition and relation identification) to identify the referent of a given entity mention. SCALE 2015 focuses on EDL over Chinese text, as described in the NIST Text Analysis Conference (TAC) Chinese Entity Discovery and Linking track. The goal of the summer effort is to improve the accuracy of individual underlying analytics, thereby increasing performance on the overall TAC EDL task. Given interest from SCALE participants, the SCALE 2015 system could be submitted as an entry in the upcoming NIST TAC Tri-lingual EDL evaluation.
SCALE 2014: Low-Resource Content-Based Recommender Systems for Spoken Documents
Within the growing sea of digital media available on the web, ranging from YouTube videos to podcasts, automatically connecting consumers with content of interest is an increasingly important goal. Content-based recommender systems leverage an individual user’s preference ratings to model their interests and information needs in order to make future recommendations. In the context of spoken documents, speech-to-text systems are traditionally used to generate lexical features on which to base these user models. However, training high accuracy speech-to-text systems requires large collections of manually transcribed speech audio, a prerequisite that limits the reach of recommender system technology.
We will investigate content-based recommender systems for spoken documents specifically in low resource settings where access to recognizer training data is limited. We will measure the downstream impact of degrading speech-to-text quality by varying not only the quantity of available training data, but also the language of application and the fidelity of the audio. In this constrained setting, alternative information sources less commonly considered in recommender systems can play an increasingly valuable role. Thus, in conjunction with transcript quality, we will evaluate the utility of speaker/language/gender recognition, acoustic event detection, entity linking, context analysis, and zero resource linguistic discovery, as well as develop the requisite information retrieval back-end techniques to fuse these noisy and heterogeneous feature inputs. Our efforts will produce an essential benchmark for the current state-of-the-art in content-based recommender systems and identify new recipes for maximizing their performance in low resource settings.
SCALE 2013: Two workshops
Linking Evidence to Reports
Given a report (from a newspaper, or information analyst) and a collection of material that contains supporting evidence, can one identify and align relevant source materials to this summary document? Keeping with the structure of SCALE 12, this project serves as a high level end-task that decomposes into a number of interesting sub-projects, in this case relevant to DARPA's DEFT program. At a coarse level, the task can be viewed as finding source documents that are relevant to a report, but the primary focus will be on a finer grain task of linking predicates and their arguments, both within and across documents.
Robust Speaker Recognition for Real Data
Research in the field of automatic speaker recognition has made great progress over the past few years, as demonstrated by impressive performance in NIST Speaker Recognition Evaluations. However, real applications can present difficulties not captured by current efforts. In this workshop, we plan to address the following challenges:
Limited training data
Recent speaker recognition techniques often assume a large set of in-domain labeled development data from which to estimate model parameters. The focus of this problem is to develop new methods of training speaker recognition systems where labeled data is limited.
A common assumption feeding many speaker recognition modeling techniques is that the labels on training and development data are 100% accurate. This challenge is to understand and develop techniques using only noisy labels for speaker recognition systems.
Large-scale speaker clustering with side-information
For some applications, speaker clustering is more interesting than per cut speaker identification. This problem addresses research of algorithms to efficiently address speaker clustering as well as the related problems of cluster merging and splitting.
SCALE 2012: Vertex Nomination on Attributed Graphs
If I know of a few "interesting'" people, how can human language technology and graph theory help me find other interesting people? If I know of a few people committing a crime (e.g. fraud), how can I determine who their co-conspirators are?
If I can infer basic properties of an individual, does this help? Given a set of actors deemed interesting, we aim to find other actors who are similarly interesting. We are given a collection of informal communications (written and spoken) and a corresponding communications graph.
In this graph, each vertex represents either a communication handle, or a communication (e.g., email), and each edge connects between a handle and a communication that that handle participated in. Our goals are three-fold: (1) posit a set of actors that use one or more handles; (2) associate author attributes with actors, based on communication content; and (3) nominate an actor as interesting, based on other actors already labeled interesting.
For an illustrative example, consider a corporate email corpus that consists of communications between actors, a few of which are committing fraud.
Some of their fraudulent activity is captured in emails between them, along with many other innocuous emails (both between the fraudsters and between the other employees in the company).
Some accounts may be used by multiple actors, such as an administrative account used by multiple administrators. Some actors may use multiple accounts, such as an administrator that uses the administrative account as well as their individual email address. We are to assign basic properties to the actors based on their language use.
We are then given the identities of a few fraudster vertices and asked to nominate one other vertex in the graph as likely representing another actor committing fraud.
SCALE 2011: Vertex Nomination
If I know of a few "interesting'" people, how can human language technology and graph theory help me find other "interesting" people? If I know of a few people committing a crime (e.g. fraud), how can I determine who their co-conspirators are?
Given a set of actors deemed "interesting", we aim to find other actors who are similarly "interesting". We are given a collection of informal communications (written and spoken) and a corresponding communications graph. In this graph, each vertex represents an actor and each edge connects a pair of actors that communicate. Attached to each edge is the set of documents where that pair of actors communicate, providing content in context (i.e. the language of a communication in the context of who speaks to whom). In this set of documents, our identified "interesting" set communicates with each other and with other actors, whose "interestingness" is unknown. Our objective is to nominate one vertex from all candidate vertices (those with unknown "interestingness"), which is most likely "interesting".
For an illustrative example, the email corpus of a hypothetical corporation consists of communications between actors, a few of which are committing fraud. Some of their fraudulent activity is captured in emails between them, along with many other innocuous emails (both between the fraudsters and between the other employees in the company). We are given the identities of a few fraudster vertices and asked to nominate one other vertex in the graph as likely representing another actor committing fraud.
SCALE 2010: All-Source Knowledge Base Population
While traditional extraction research has focused on lexically‐anchored facts, typically within a sentence, SCALE 2010 explored research opportunities in inferring facts from non‐explicit or latent features spanning multiple utterances, documents, or conversations. Examples include inferring the relationship between dialogue participants based on an analysis of register, discourse structure, utterance length, emotion, prosody, etc.; and identifying the sentiment of speakers or authors towards specific entities or topics. Knowledge‐Base Population (KBP) is related to other areas of HLT such as extraction and question‐answering, but is focused on the insertion of information into a knowledge base.
Much of the prior HLT work has focused primarily on vast quantities of English newswire. SCALE 2010 focused on different sources, especially informal communications, spoken as well as written. The team also explored what could be done with multiple sources, an additional language, and limited training data. To address these challenges, SCALE 2010 focused on all‐source knowledge base population. Research directions included:
- Multiple‐sources: including conversational speech and informal text genres
- Languages: English and Arabic, individually and in combination
- Linking entities into a knowledge base
- Inference of higher‐order knowledge units such as relations and sentiment
- Maintaining knowledge base viability when augmenting it over time
SCALE 2009: Semantically Informed Machine Translation / Robust Speech Technology [2 topics]
Semantically Informed Machine Translation
The team's approach was to detect and provide semantic structure to High-Information-Value Elements (HIVEs) in foreign language to inform the machine translation (MT) process and produce better translations.
Robust Speech Technology
When attempting to extract information from a speech corpus, it is desirable to employ speech processing tools, such as speech recognizers, that will facilitate the task. In many situations such tools may not be available because of the lack of transcribed speech for training the recognizers. The workshop focused on this problem by furthering the technology for automatically training speech recognizers without the use of manual transcriptions. Previous work had demonstrated that automatically trained speech recognizers without transcriptions can be useful for information extraction from speech. This previous work, while establishing the feasibility of the training of speech recognizers without transcriptions still required much work to improve performance and robustness. In addition, the team wanted to learn how to more effectively use small amounts (less than an hour) of manually transcribed data as it becomes available. The approach to training this recognizer followed a sequential, hierarchical approach to learning, where at each stage of the process they extracted information from the corpus to make information available for the next stage.