A Dataset to Support Research on Bilingual Lexicons for Machine Translation (MT)

November 25, 2019

Bilingual lexicons (or bilingual dictionaries) are valuable resources for machine translation. For example, when working with technical documents like patents, a bilingual lexicon consisting of technical jargon is important for ensuring that the translation is precise and correct.

At the conference on Empirical Methods in Natural Language Processing (EMNLP) in November 2019, JHU researchers released a paper and an associated dataset that will facilitate research in bilingual lexicons for machine translation. The dataset consists of manually curated lexicon entries (e.g. technical jargon in English and their Chinese equivalent), cross-referenced with standard evaluation data for machine translation. The dataset contains over 30 thousand bilingual lexicon entries across three language pairs.
Although neural machine translations have obtained impressive translation accuracy in certain scenarios, it is currently unclear what is the most efficient means of incorporating bilingual lexicons. One could incorporate lexicons as constraints during decoding, or as additional data during training. The goal of the dataset release is to create a standard benchmark for evaluating different techniques for bilingual lexicon incorporation.

For more details and instructions for the dataset, please refer to:

HABLex: Human Annotated Bilingual Lexicons for Experiments in Machine Translation
Brian Thompson, Rebecca Knowles, Xuan Zhang, Huda Khayrallah, Kevin Duh, Philipp Koehn
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)




Human Language Technology Center of Excellence