Symbolic constraints and statistical methods: Use together for best results

Constantine Lignos

February 18, 2013


Language is a product of the human mind, and language data reflect the mind's underlying structures, constraints, and limitations. Many of these restrictions appear to be symbolic and may not have obvious statistical analogues. In this talk, I will present my research into two areas where taking advantage of insights into linguistic and cognitive structure has enabled the development of computationally efficient solutions that combine the strengths of symbolic and statistical methods.

(1) Language and codeswitching identification: short, often mixed-language messages such as those on Twitter pose obvious challenges for language identification systems designed for single language documents. I'll present Codeswitchador, a system I developed as a part of SCALE 2012 which can accurately perform word-by-word language identification in short messages and identify codeswitching in large scale data sets. I'll discuss the application of the system to construct the first large-scale corpus of Spanish/English codeswitched tweets and evaluate previous linguistic claims made regarding preferred contexts and structural constraints on codeswitching.

(2) Infant word segmentation: During the first year of life, infants begin to segment words from a continuous stream of sounds. While previous computational models have proposed possible statistical solutions to word segmentation, these models make no attempt to be cognitively plausible or reflect infants' development. I’ll review previous adult and infant word segmentation experiments and draw on that work to motivate an efficient, cognitively-oriented, online-learning word segmentation model. I’ll demonstrate that it performs well and displays characteristics of children's changes in performance and error patterns.

This line of research demonstrates that taking advantage of linguistic structure in conjunction with large scale data can lead to the development of high-performing, computationally efficient solutions for natural language problems.


Constantine Lignos is a PhD student in the University of Pennsylvania Computer and Information Science department. His research focuses on efficient approaches to unsupervised and resource-constrained language processing applications. His main areas of research include unsupervised learning of words and word structure, modeling cognitive language processes, and natural language understanding for robotics. He has built a number of language learning and understanding systems with cognitive underpinnings: MORSEL, a rule-based unsupervised morphological analyzer, CATS, an efficient and cognitively plausible infant word segmentation model, SLURP, a system for natural language understanding for human-robot interaction, and Codeswitchador, a system for language and codeswitching identification at the word level developed as a part of SCALE 2012. Before starting graduate school, he received a B.A. in Computer Science and Psychology from Yale and worked at Microsoft on speech recognition, text-to-speech, and dialog systems for automotive platforms, contributing to the Ford Sync and Kia UVO products.