SCALE 2021

Cross Language Information Retrieval 

June 1-August 6, 2021 (10 weeks), Baltimore, Maryland


We are seeking outstanding faculty, researchers, graduate and undergraduate students for summer positions in information retrieval and natural language processing at the SCALE summer workshop at the Johns Hopkins University Human Language Technology Center of Excellence. Participants will ideally attend for all ten weeks, though exceptions will be considered for shorter duration participation. U.S. citizenship is required.

Task Description

SCALE 2021 will focus on applying neural network advances to cross-language information retrieval (CLIR), in which a user poses queries in one language and receives relevant documents in another language. The workshop will conduct research aimed at building state-of-the-art systems for ranking foreign language documents retrieved for information needs expressed in English. In addition to this primary CLIR task, three extensions will be pursued: 1) improving search by exploiting a user’s previous queries or work product, 2) diversity ranking of multilingual retrieval results to ensure the user sees all aspects of those results, and 3) retrieval of informal foreign language content in texts such as microblogs. If successful, the project will provide non-speakers of a language the ability to search over that language, and will enable discovery of relevant documents written in multiple languages with a single query.

Primary Task: Cross-Language Information Retrieval

The main task for SCALE 2021 is cross-language information retrieval that ranks documents in a high resource language such as Spanish or Russian based on an English query.  We are interested in three variants of this primary task: initial document retrieval, and document re-ranking. In the former task, a large collection of documents (500,000 or orders of magnitude more) are searched to find those relevant to an English query. In the latter task, we assume that ranked results of hundreds or thousands of documents that are possibly relevant to the user’s overall efforts have already been created; the purpose of this system is to move to the beginning of that list those documents most likely to be relevant to the information need. SCALE will produce a state-of-the-art neural system for cross-language retrieval.

Extension A: Searcher-Biased Cross-Language Ranked Retrieval

The goal of this extension is to create a model of professional searcher’s interest to improve retrieval ranking. The model might be based on previous work product, prior queries, elicited preferences, etc. For example, given a ranked list of Spanish documents, re-rank the documents based both on the original scores by which they were ranked and on their relationships to a model of the user’s interests. If successful, the work done for this extension will: enable multiple sources of information to be integrated into a query that the user likely could not create on their own without significant manual effort; and, produce better ranked results, thereby reducing time spent reviewing less relevant documents.

Extension B: Diversity Ranking

This extension is designed to ensure that every aspect of the information need that is present in the user’s in-box is covered by the top ranked results. That is, no one aspect should dominate these top results. The techniques we develop should work well even in the case where the in-box contains documents in more than one language. If successful, this extension will give the user a broad view of the aspects present in the in-box, without needing to first read a large number of documents.

Extension C: Cross-Language Retrieval of Social Media Content

The purpose of this extension is to retrieve informal language content (e.g., tweets) in foreign languages using an English query. This is a fundamentally different task than the main CLIR task because of the special properties of social media language.  For example, such “documents”  are short, informal, unedited, and often lacking in context. They contain misspellings, ad hoc abbreviations, elisions, emojis, and a host of other phenomena not found in newswire and other formal genres. If successful, this extension will provide a capability that has not been well studied heretofore.

Corpora and Metrics

Experimental validation is a key part of a SCALE workshop. This work will be evaluated on a new cross language information retrieval collection  from three to five non-English languages with 50 event-oriented topics per language. Baseline performance numbers will be generated from existing MATERIAL and BETTER systems developed through those IARPA programs  and an initial SCALE system. The performance metrics will be mean average precision and nDCG. 


For Additional Information: 

Contact us at [email protected]

Human Language Technology Center of Excellence