Publications

Additional (non-HLTCOE) publications may be found on researchers' personal websites.


Loading...

2015 (32 total)

Inferring Latent User Properties from Texts Published in Social Media (Demo)
Svitlana Volkova, Yoram Bachrach, Michael Armstrong and Vijay Sharma
Proceedings of the Twenty-Ninth Conference on Artificial Intelligence (AAAI) – 2015

[pdf] | [bib]

@inproceedings{volkova-EtAl:2015:AAAI, author = {Volkova, Svitlana and Yoram Bachrach and Michael Armstrong and Vijay Sharma}, title = {Inferring Latent User Properties from Texts Published in Social Media (Demo)}, booktitle = {Proceedings of the Twenty-Ninth Conference on Artificial Intelligence (AAAI)}, month = {January}, year = {2015}, address = {Austin, TX}, url = {http://www.aclweb.org/anthology/P/P14/P14-1018} }

Online Bayesian Models for Personal Analytics in Social Media
Svitlana Volkova and Benjamin Van Durme
Proceedings of the Twenty-Ninth Conference on Artificial Intelligence (AAAI) – 2015

[pdf] | [bib]

@inproceedings{volkova-vandurme:2015:AAAI, author = {Volkova, Svitlana and Van Durme, Benjamin}, title = {Online Bayesian Models for Personal Analytics in Social Media}, booktitle = {Proceedings of the Twenty-Ninth Conference on Artificial Intelligence (AAAI)}, month = {January}, year = {2015}, address = {Austin, TX} }

The Hurricane Sandy Twitter Corpus
Haoyu Wang, Eduard Hovy and Mark Dredze
AAAI Workshop on the World Wide Web and Public Health Intelligence – 2015

[bib]

@inproceedings{Wang:2015ve, author = {Haoyu Wang and Eduard Hovy and Dredze, Mark}, title = {The Hurricane Sandy Twitter Corpus}, booktitle = {AAAI Workshop on the World Wide Web and Public Health Intelligence}, year = {2015} }

Social Media as a Sensor of Air Quality and Public Response in China
Shiliang Wang, Michael Paul and Mark Dredze
AAAI Workshop on the World Wide Web and Public Health Intelligence – 2015

[bib]

@inproceedings{Wang:2015eu, author = {Shiliang Wang and Michael Paul and Dredze, Mark}, title = {Social Media as a Sensor of Air Quality and Public Response in China}, booktitle = {AAAI Workshop on the World Wide Web and Public Health Intelligence}, year = {2015} }

Worldwide Influenza Surveillance through Twitter
Michael Paul, Mark Dredze, David Broniatowski and Nicholas Generous
AAAI Workshop on the World Wide Web and Public Health Intelligence – 2015

[bib]

@inproceedings{Paul:2015la, author = {Michael Paul and Dredze, Mark and David Broniatowski and Nicholas Generous}, title = {Worldwide Influenza Surveillance through Twitter}, booktitle = {AAAI Workshop on the World Wide Web and Public Health Intelligence}, year = {2015} }

Tobacco Watcher: Real-time Global Surveillance for Tobacco Control
Joanna Cohen, John Ayers and Mark Dredze
World Conference on Tobacco or Health (WCTOH) – 2015

[bib]

@inproceedings{Cohen:2015zl, author = {Joanna Cohen and John Ayers and Dredze, Mark}, title = {Tobacco Watcher: Real-time Global Surveillance for Tobacco Control}, booktitle = {World Conference on Tobacco or Health (WCTOH)}, year = {2015} }

A Chinese Concrete NLP Pipeline
Nanyun Peng, Francis Ferraro, Mo Yu, Nicholas Andrews, Jay DeYoung, Max Thomas, Matt Gormley, Travis Wolfe, Craig Harman, Benjamin Van Durme and Mark Dredze
North American Chapter of the Association for Computational Linguistics (NAACL), Demonstration Session – 2015

[abstract] [bib]

Abstract

Natural language processing research increasingly relies on the output of a variety of syntactic and semantic analytics. Yet integrating output from multiple analytics into a single framework can be time consuming and slow research progress. We present a CONCRETE Chinese NLP Pipeline: an NLP stack built using a series of open source systems integrated based on the CONCRETE data schema. Our pipeline includes data ingest, word segmentation, part of speech tagging, parsing, named entity recognition, relation extraction and cross document coreference resolution. Additionally, we integrate a tool for visualizing these annotations as well as allowing for the manual annotation of new data. We release our pipeline to the research community to facilitate work on Chinese language tasks that require rich linguistic annotations.
@inproceedings{Peng:2015yf, author = {Nanyun Peng and Francis Ferraro and Mo Yu and Andrews, Nicholas and Jay DeYoung and Max Thomas and Gormley, Matt and Wolfe, Travis and Harman, Craig and Van Durme, Benjamin and Dredze, Mark}, title = {A Chinese Concrete NLP Pipeline}, booktitle = {North American Chapter of the Association for Computational Linguistics (NAACL), Demonstration Session}, year = {2015}, url = {http://aclweb.org/anthology/N/N15/N15-3018.pdf}, abstract = {Natural language processing research increasingly relies on the output of a variety of syntactic and semantic analytics. Yet integrating output from multiple analytics into a single framework can be time consuming and slow research progress. We present a CONCRETE Chinese NLP Pipeline: an NLP stack built using a series of open source systems integrated based on the CONCRETE data schema. Our pipeline includes data ingest, word segmentation, part of speech tagging, parsing, named entity recognition, relation extraction and cross document coreference resolution. Additionally, we integrate a tool for visualizing these annotations as well as allowing for the manual annotation of new data. We release our pipeline to the research community to facilitate work on Chinese language tasks that require rich linguistic annotations.} }

An Empirical Study of Chinese Name Matching and Applications
Nanyun Peng, Mo Yu and Mark Dredze
Association for Computational Linguistics (ACL) (short paper) – 2015

[bib]

@inproceedings{Peng:2015db, author = {Nanyun Peng and Mo Yu and Dredze, Mark}, title = {An Empirical Study of Chinese Name Matching and Applications}, booktitle = {Association for Computational Linguistics (ACL) (short paper)}, year = {2015} }

Improved Relation Extraction with Feature-Rich Compositional Embedding Models
Matt Gormley, Mo Yu and Mark Dredze
Empirical Methods in Natural Language Processing (EMNLP) – 2015

[bib]

@inproceedings{Gormley:2015ly, author = {Gormley, Matt and Mo Yu and Dredze, Mark}, title = {Improved Relation Extraction with Feature-Rich Compositional Embedding Models}, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, year = {2015} }

Approximation-Aware Dependency Parsing by Belief Propagation
Matt Gormley, Jason Eisner and Mark Dredze
Transactions of the Association for Computational Linguistics (TACL) – 2015

[bib]

@article{Gormley:2015sf, author = {Gormley, Matt and Eisner, Jason and Dredze, Mark}, title = {Approximation-Aware Dependency Parsing by Belief Propagation}, year = {2015} }

CLPsych 2015 Shared Task: Depression and PTSD on Twitter
Glen Coppersmith, Mark Dredze, Craig Harman, Kristy Hollingshead and Margaret Mitchell
NAACL Workshop on Computational Linguistics and Clinical Psychology – NAACL Workshop on Computational Linguistics and Clinical Psychology – 2015

[abstract] [bib]

Abstract

This paper presents a summary of the Computational Linguistics and Clinical Psychology (CLPsych) 2015 shared and unshared tasks. These tasks aimed to provide apples-to-apples comparisons of various approaches to modeling language relevant to mental health from social media. The data used for these tasks is from Twitter users who state a diagnosis of depression or post traumatic stress disorder (PTSD) and demographically-matched community controls. The unshared task was a hackathon held at Johns Hopkins University in November 2014 to explore the data, and the shared task was conducted remotely, with each participating team submitted scores for a held-back test set of users. The shared task consisted of three binary classification experiments: (1) depression versus control, (2) PTSD versus control, and (3) depression versus PTSD. Classifiers were compared primarily via their average precision, though a number of other metrics are used along with this to allow a more nuanced interpretation of the performance measures.
@inproceedings{Coppersmith:2015eu, author = {Coppersmith, Glen and Dredze, Mark and Harman, Craig and Kristy Hollingshead and Mitchell, Margaret}, title = {CLPsych 2015 Shared Task: Depression and PTSD on Twitter}, booktitle = {NAACL Workshop on Computational Linguistics and Clinical Psychology}, year = {2015}, abstract = {This paper presents a summary of the Computational Linguistics and Clinical Psychology (CLPsych) 2015 shared and unshared tasks. These tasks aimed to provide apples-to-apples comparisons of various approaches to modeling language relevant to mental health from social media. The data used for these tasks is from Twitter users who state a diagnosis of depression or post traumatic stress disorder (PTSD) and demographically-matched community controls. The unshared task was a hackathon held at Johns Hopkins University in November 2014 to explore the data, and the shared task was conducted remotely, with each participating team submitted scores for a held-back test set of users. The shared task consisted of three binary classification experiments: (1) depression versus control, (2) PTSD versus control, and (3) depression versus PTSD. Classifiers were compared primarily via their average precision, though a number of other metrics are used along with this to allow a more nuanced interpretation of the performance measures.} }

Combining Word Embeddings and Feature Embeddings for Fine-grained Relation Extraction
Mo Yu, Matt Gormley and Mark Dredze
North American Chapter of the Association for Computational Linguistics (NAACL) – 2015

[abstract] [bib]

Abstract

Compositional embedding models build a representation for a linguistic structure based on its component word embeddings. While recent work has combined these word embeddings with hand crafted features for improved performance, it was restricted to a small number of features due to model complexity, thus limiting its applicability. We propose a new model that conjoins features and word embeddings while maintaining a small number of parameters by learning feature embeddings jointly with the parameters of a compositional model. The result is a method that can scale to more features and more labels, while avoiding overfitting. We demonstrate that our model attains state-of-the-art results on ACE and ERE fine-grained relation extraction.
@inproceedings{Yu:2015rt, author = {Mo Yu and Gormley, Matt and Dredze, Mark}, title = {Combining Word Embeddings and Feature Embeddings for Fine-grained Relation Extraction}, booktitle = {North American Chapter of the Association for Computational Linguistics (NAACL)}, year = {2015}, abstract = {Compositional embedding models build a representation for a linguistic structure based on its component word embeddings. While recent work has combined these word embeddings with hand crafted features for improved performance, it was restricted to a small number of features due to model complexity, thus limiting its applicability. We propose a new model that conjoins features and word embeddings while maintaining a small number of parameters by learning feature embeddings jointly with the parameters of a compositional model. The result is a method that can scale to more features and more labels, while avoiding overfitting. We demonstrate that our model attains state-of-the-art results on ACE and ERE fine-grained relation extraction.} }

Discovering Links between Reports of Celebrity Suicides and Suicidal Ideation from Social Media
Mrinal Kumar, Mark Dredze, Glen Coppersmith and Munmun Choudhury
NAACL Workshop on Computational Linguistics and Clinical Psychology – 2015

[bib]

@inproceedings{Kumar:2015dq, author = {Mrinal Kumar and Dredze, Mark and Coppersmith, Glen and Munmun Choudhury}, title = {Discovering Links between Reports of Celebrity Suicides and Suicidal Ideation from Social Media}, booktitle = {NAACL Workshop on Computational Linguistics and Clinical Psychology}, year = {2015} }

Entity Linking for Spoken Language
Adrian Benton and Mark Dredze
North American Chapter of the Association for Computational Linguistics (NAACL) – 2015

[abstract] [bib]

Abstract

Research on entity linking has considered a broad range of text, including newswire, blogs and web documents in multiple languages. However, the problem of entity linking for spoken language remains unexplored. Spoken language obtained from automatic speech recognition systems poses different types of challenges for entity linking; transcription errors can distort the context, and named entities tend to have high error rates. We propose features to mitigate these errors and evaluate the impact of ASR errors on entity linking using a new corpus of entity linked broadcast news transcripts.
@inproceedings{Benton:2015qq, author = {Adrian Benton and Dredze, Mark}, title = {Entity Linking for Spoken Language}, booktitle = {North American Chapter of the Association for Computational Linguistics (NAACL)}, year = {2015}, url = {http://aclweb.org/anthology/N/N15/N15-1024.pdf}, abstract = {Research on entity linking has considered a broad range of text, including newswire, blogs and web documents in multiple languages. However, the problem of entity linking for spoken language remains unexplored. Spoken language obtained from automatic speech recognition systems poses different types of challenges for entity linking; transcription errors can distort the context, and named entities tend to have high error rates. We propose features to mitigate these errors and evaluate the impact of ASR errors on entity linking using a new corpus of entity linked broadcast news transcripts.} }

Evaluation of the Great American Smokeout by Digital Surveillance
J. Westmaas, John Ayers, Mark Dredze and Benjamin Althouse
Society of Behavioral Medicine – 2015

[bib]

@inproceedings{Westmaas:2015rw, author = {J. Westmaas and John Ayers and Dredze, Mark and Benjamin Althouse}, title = {Evaluation of the Great American Smokeout by Digital Surveillance}, booktitle = {Society of Behavioral Medicine}, year = {2015} }

FrameNet+: Fast Paraphrastic Tripling of FrameNet
Ellie Pavlick, Travis Wolfe, Pushpendre Rastogi, Chris Callison-Burch, Mark Dredze and Benjamin Van Durme
Association for Computational Linguistics (ACL) (short paper) – 2015

[bib]

@inproceedings{Pavlick:2015bs, author = {Ellie Pavlick and Wolfe, Travis and Pushpendre Rastogi and Callison-Burch, Chris and Dredze, Mark and Van Durme, Benjamin}, title = {FrameNet+: Fast Paraphrastic Tripling of FrameNet}, booktitle = {Association for Computational Linguistics (ACL) (short paper)}, year = {2015} }

From ADHD to SAD: analyzing the language of mental health on Twitter through self-reported diagnoses
Glen Coppersmith, Mark Dredze, Craig Harman and Kristy Hollingshead
NAACL Workshop on Computational Linguistics and Clinical Psychology – 2015

[abstract] [bib]

Abstract

Many significant challenges exist for the mental health field, but one in particular is a lack of data available to guide research. Language provides a natural lens for studying mental health -- much existing work and therapy have strong linguistic components, so the creation of a large, varied, language-centric dataset could provide significant grist for the field of mental health research. We examine a broad range of mental health conditions in Twitter data by identifying self-reported statements of diagnosis. We systematically explore language differences between ten conditions with respect to the general population, and to each other. Our aim is to provide guidance and a roadmap for where deeper exploration is likely to be fruitful.
@inproceedings{coppersmith15a, author = {Coppersmith, Glen and Dredze, Mark and Harman, Craig and Kristy Hollingshead}, title = {From ADHD to SAD: analyzing the language of mental health on Twitter through self-reported diagnoses}, booktitle = {NAACL Workshop on Computational Linguistics and Clinical Psychology}, year = {2015}, abstract = {Many significant challenges exist for the mental health field, but one in particular is a lack of data available to guide research. Language provides a natural lens for studying mental health -- much existing work and therapy have strong linguistic components, so the creation of a large, varied, language-centric dataset could provide significant grist for the field of mental health research. We examine a broad range of mental health conditions in Twitter data by identifying self-reported statements of diagnosis. We systematically explore language differences between ten conditions with respect to the general population, and to each other. Our aim is to provide guidance and a roadmap for where deeper exploration is likely to be fruitful.} }

Interactive Knowledge Base Population
Travis Wolfe, Mark Dredze, James Mayfield, Paul McNamee, Craig Harman, Tim Finin and Benjamin Van Durme
arXiv – 2015

[bib]

@article{Wolfe:2015qr, author = {Wolfe, Travis and Dredze, Mark and Mayfield, James and McNamee, Paul and Harman, Craig and Finin, Tim and Van Durme, Benjamin}, title = {Interactive Knowledge Base Population}, year = {2015}, url = {http://arxiv.org/pdf/1506.00301.pdf} }

Learning Composition Models for Phrase Embeddings
Mo Yu and Mark Dredze
Transactions of the Association for Computational Linguistics – 2015

[abstract] [bib]

Abstract

Lexical embeddings can serve as useful representations for words for a variety of NLP tasks, but learning embeddings for phrases can be challenging. While separate embeddings are learned for each word, this is infeasible for every phrase. We construct phrase embeddings by learning how to compose word embeddings using features that capture phrase structure and context. We propose efficient unsupervised and task-specific learning objectives that scale our model to large datasets. We demonstrate improvements on both language modeling and several phrase semantic similarity tasks with various phrase lengths. We make the implementation of our model and the datasets available for general use.
@article{TACL586, author = {Mo Yu and Dredze, Mark}, title = {Learning Composition Models for Phrase Embeddings}, year = {2015}, pages = {227--242}, url = {https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/586/125}, abstract = {Lexical embeddings can serve as useful representations for words for a variety of NLP tasks, but learning embeddings for phrases can be challenging. While separate embeddings are learned for each word, this is infeasible for every phrase. We construct phrase embeddings by learning how to compose word embeddings using features that capture phrase structure and context. We propose efficient unsupervised and task-specific learning objectives that scale our model to large datasets. We demonstrate improvements on both language modeling and several phrase semantic similarity tasks with various phrase lengths. We make the implementation of our model and the datasets available for general use.} }

Predicate Argument Alignment using a Global Coherence Model
Travis Wolfe, Mark Dredze and Benjamin Van Durme
North American Chapter of the Association for Computational Linguistics (NAACL) – 2015

[abstract] [bib]

Abstract

We present a joint model for predicate argument alignment. We leverage multiple sources of semantic information, including temporal ordering constraints between events. These are combined in a max-margin framework to find a globally consistent view of entities and events across multiple documents, which leads to improvements over a very strong local baseline.
@inproceedings{Wolfe:2015qf, author = {Wolfe, Travis and Dredze, Mark and Van Durme, Benjamin}, title = {Predicate Argument Alignment using a Global Coherence Model}, booktitle = {North American Chapter of the Association for Computational Linguistics (NAACL)}, year = {2015}, url = {http://aclweb.org/anthology/N/N15/N15-1002.pdf}, abstract = {We present a joint model for predicate argument alignment. We leverage multiple sources of semantic information, including temporal ordering constraints between events. These are combined in a max-margin framework to find a globally consistent view of entities and events across multiple documents, which leads to improvements over a very strong local baseline.} }

Results from the Centers for Disease Control and Prevention's Predict the 2013--2014 Influenza Season Challenge
M. Biggerstaff, D. Alper, Mark Dredze, S. Fox, I. Fung, K. Hickmann, B. Lewis, R. Rosenfeld, J. Shaman, M-H. Tsou, P. Velardi, A. Vespignani and L. Finelli
International Conference of Emerging Infectious Diseases Conference – 2015

[bib]

@inproceedings{Biggerstaff:2015kb, author = {M. Biggerstaff and D. Alper and Dredze, Mark and S. Fox and I. Fung and K. Hickmann and B. Lewis and R. Rosenfeld and J. Shaman and M-H. Tsou and P. Velardi and A. Vespignani and L. Finelli}, title = {Results from the Centers for Disease Control and Prevention's Predict the 2013--2014 Influenza Season Challenge}, booktitle = {International Conference of Emerging Infectious Diseases Conference}, year = {2015} }

Shifts in Suicidal Ideation Manifested in Social Media Following Celebrity Suicides
Mrinal Kumar, Mark Dredze, Glen Coppersmith and Munmun Choudhury
Conference on Hypertext and Social Media – 2015

[abstract] [bib]

Abstract

The Werther effect describes the increased rate of completed or attempted suicides following the depiction of an individual's suicide in the media, typically a celebrity. We present findings on the prevalence of this effect in an online platform: r/SuicideWatch on Reddit. We examine both the posting activity and post content after the death of ten high-profile suicides. Posting activity increases following reports of celebrity suicides, and post content exhibits considerable changes that indicate increased suicidal ideation. Specifically, we observe that post-celebrity suicide content is more likely to be inward focused, manifest decreased social concerns, and laden with greater anxiety, anger, and negative emotion. Topic model analysis further reveals content in this period to switch to a more derogatory tone that bears evidence of self-harm and suicidal tendencies. We discuss the implications of our findings in enabling better community support to psychologically vulnerable populations, and the potential of building suicide prevention interventions following high-profile suicides.
@inproceedings{Kumar:2015sf, author = {Mrinal Kumar and Dredze, Mark and Coppersmith, Glen and Munmun Choudhury}, title = {Shifts in Suicidal Ideation Manifested in Social Media Following Celebrity Suicides}, booktitle = {Conference on Hypertext and Social Media}, year = {2015}, abstract = {The Werther effect describes the increased rate of completed or attempted suicides following the depiction of an individual's suicide in the media, typically a celebrity. We present findings on the prevalence of this effect in an online platform: r/SuicideWatch on Reddit. We examine both the posting activity and post content after the death of ten high-profile suicides. Posting activity increases following reports of celebrity suicides, and post content exhibits considerable changes that indicate increased suicidal ideation. Specifically, we observe that post-celebrity suicide content is more likely to be inward focused, manifest decreased social concerns, and laden with greater anxiety, anger, and negative emotion. Topic model analysis further reveals content in this period to switch to a more derogatory tone that bears evidence of self-harm and suicidal tendencies. We discuss the implications of our findings in enabling better community support to psychologically vulnerable populations, and the potential of building suicide prevention interventions following high-profile suicides.} }

Social Media as a Sensor of Air Quality and Public Response in China
Shiliang Wang, Michael Paul and Mark Dredze
Journal of Medical Internet Research (JMIR) – 2015

[bib]

@article{Wang:2015e, author = {Shiliang Wang and Michael Paul and Dredze, Mark}, title = {Social Media as a Sensor of Air Quality and Public Response in China}, year = {2015}, url = {http://www.jmir.org/2015/3/e22/} }

SPRITE: Generalizing Topic Models with Structured Priors
Michael Paul and Mark Dredze
Transactions of the Association for Computational Linguistics (TACL) – 2015

[abstract] [bib]

Abstract

We introduce SPRITE, a family of topic models that incorporates structure into model priors as a function of underlying components. The structured priors can be constrained to model topic hierarchies, factorizations, correlations, and supervision, allowing SPRITE to be tailored to particular settings. We demonstrate this flexibility by constructing a SPRITE-based model to jointly infer topic hierarchies and author perspective, which we apply to corpora of political debates and online reviews. We show that the model learns intuitive topics, outperforming several other topic models at predictive tasks.
@article{Paul:2015sf, author = {Michael Paul and Dredze, Mark}, title = {SPRITE: Generalizing Topic Models with Structured Priors}, year = {2015}, url = {https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/403/106}, abstract = {We introduce SPRITE, a family of topic models that incorporates structure into model priors as a function of underlying components. The structured priors can be constrained to model topic hierarchies, factorizations, correlations, and supervision, allowing SPRITE to be tailored to particular settings. We demonstrate this flexibility by constructing a SPRITE-based model to jointly infer topic hierarchies and author perspective, which we apply to corpora of political debates and online reviews. We show that the model learns intuitive topics, outperforming several other topic models at predictive tasks.} }

The Hurricane Sandy Twitter Corpus
Haoyu Wang, Eduard Hovy and Mark Dredze
AAAI Workshop on the World Wide Web and Public Health Intelligence – 2015

[abstract] [bib]

Abstract

The growing use of social media has made it a critical component of disaster response and recovery efforts. Both in terms of preparedness and response, public health officials and first responders have turned to automated tools to assist with organizing and visualizing large streams of social media. In turn, this has spurred new research into algorithms for information extraction, event detection and organization, and information visualization. One challenge of these efforts has been the lack of a common corpus for disaster response on which researchers can compare and contrast their work. This paper describes the Hurricane Sandy Twitter Corpus: 6.5 million geotagged Twitter posts from the geographic area and time period of the 2012 Hurricane Sandy.
@inproceedings{Wang:2015ve, author = {Haoyu Wang and Eduard Hovy and Dredze, Mark}, title = {The Hurricane Sandy Twitter Corpus}, booktitle = {AAAI Workshop on the World Wide Web and Public Health Intelligence}, year = {2015}, abstract = {The growing use of social media has made it a critical component of disaster response and recovery efforts. Both in terms of preparedness and response, public health officials and first responders have turned to automated tools to assist with organizing and visualizing large streams of social media. In turn, this has spurred new research into algorithms for information extraction, event detection and organization, and information visualization. One challenge of these efforts has been the lack of a common corpus for disaster response on which researchers can compare and contrast their work. This paper describes the Hurricane Sandy Twitter Corpus: 6.5 million geotagged Twitter posts from the geographic area and time period of the 2012 Hurricane Sandy.} }

Tobacco Watcher: Real-time Global Surveillance for Tobacco Control
Joanna Cohen, John Ayers and Mark Dredze
World Conference on Tobacco or Health (WCTOH) – 2015

[bib]

@inproceedings{Cohen:2015zl, author = {Joanna Cohen and John Ayers and Dredze, Mark}, title = {Tobacco Watcher: Real-time Global Surveillance for Tobacco Control}, booktitle = {World Conference on Tobacco or Health (WCTOH)}, year = {2015} }

Tobacco Watcher: Real-Time Global Tobacco Surveillance Using Online News Media
Joanna Cohen, Rebecca Shillenn, Mark Dredze and John Ayers
Annual Meeting of the Society for Research on Nicotine and Tobacco – 2015

[bib]

@inproceedings{Cohen:2015hl, author = {Joanna Cohen and Rebecca Shillenn and Dredze, Mark and John Ayers}, title = {Tobacco Watcher: Real-Time Global Tobacco Surveillance Using Online News Media}, booktitle = {Annual Meeting of the Society for Research on Nicotine and Tobacco}, year = {2015} }

Tracking Public Awareness of Influenza through Twitter
Michael Smith, David Broniatowski, Michael Paul and Mark Dredze
3rd International Conference on Digital Disease Detection (DDD) – 2015

[bib]

@inproceedings{Smith:2015mz, author = {Michael Smith and David Broniatowski and Michael Paul and Dredze, Mark}, title = {Tracking Public Awareness of Influenza through Twitter}, booktitle = {3rd International Conference on Digital Disease Detection (DDD)}, year = {2015} }

Using Social Media to Perform Local Influenza Surveillance in an Inner-City Hospital
David Broniatowski, Mark Dredze, Michael Paul and Andrea Dugas
JMIR Public Health and Surveillance – 2015

[bib]

@article{Broniatowski:2015pi, author = {David Broniatowski and Dredze, Mark and Michael Paul and Andrea Dugas}, title = {Using Social Media to Perform Local Influenza Surveillance in an Inner-City Hospital}, year = {2015} }

Worldwide Influenza Surveillance through Twitter
Michael Paul, Mark Dredze, David Broniatowski and Nicholas Generous
AAAI Workshop on the World Wide Web and Public Health Intelligence – 2015

[bib]

@inproceedings{Paul:2015la, author = {Michael Paul and Dredze, Mark and David Broniatowski and Nicholas Generous}, title = {Worldwide Influenza Surveillance through Twitter}, booktitle = {AAAI Workshop on the World Wide Web and Public Health Intelligence}, year = {2015} }

Youth Violence: What We Know and What We Need to Know
Brad Bushman, Katherine Newman, Sandra Calvert, Geraldine Downey, Mark Dredze, Michael Gottfredson, Nina Jablonski, Ann Masten, Calvin Morrill, Daniel Neill, Daniel Romer and Daniel Webster
American Psychologist – 2015

[bib]

@article{Bushman:2015fj, author = {Brad Bushman and Katherine Newman and Sandra Calvert and Geraldine Downey and Dredze, Mark and Michael Gottfredson and Nina Jablonski and Ann Masten and Calvin Morrill and Daniel Neill and Daniel Romer and Daniel Webster}, title = {Youth Violence: What We Know and What We Need to Know}, year = {2015} }

Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings
Nanyun Peng and Mark Dredze
Empirical Methods in Natural Language Processing (EMNLP) – 2015

[bib]

@inproceedings{Peng:2015rt, author = {Nanyun Peng and Dredze, Mark}, title = {Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings}, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, year = {2015} }

Back to Top

2014 (67 total)

Entity Type Recognition for Heterogeneous Semantic Graphs
Jennifer Sleeman, Tim Finin and Anupam Joshi
AI Magazine – 2014

[pdf] | [bib]

@article{Entity_Type_Recognition_for_Heterogeneous_Semantic_Graphs, author = {Jennifer Sleeman and Finin, Tim and Anupam Joshi}, title = {Entity Type Recognition for Heterogeneous Semantic Graphs}, month = {September}, year = {2014} }

Meerkat Mafia: Multilingual and Cross-Level Semantic Textual Similarity systems
Abhay Kashyap, Lushan Han, Roberto Yus, Jennifer Sleeman, Taneeya Satyapanich, Sunil Gandhi and Tim Finin
Proceedings of the 8th International Workshop on Semantic Evaluation – 2014

[pdf] | [bib]

@inproceedings{Meerkat_Mafia_Multilingual_and_Cross_Level_Semantic_Textual_Similarity_systems, author = {Abhay Kashyap and Lushan Han and Roberto Yus and Jennifer Sleeman and Taneeya Satyapanich and Sunil Gandhi and Finin, Tim}, title = {Meerkat Mafia: Multilingual and Cross-Level Semantic Textual Similarity systems}, booktitle = {Proceedings of the 8th International Workshop on Semantic Evaluation}, month = {August}, year = {2014}, publisher = {Association for Computational Linguistics}, pages = {416-423} }

Efficient Elicitation of Annotations for Human Evaluation of Machine Translation
Keisuke Sakaguchi, Matt Post and Benjamin Van Durme
Proceedings of the Workshop on Statistical Machine Translation – 2014

[pdf] | [bib]

@inproceedings{sakaguchi2014efficient, author = {Keisuke Sakaguchi and Post, Matt and Van Durme, Benjamin}, title = {Efficient Elicitation of Annotations for Human Evaluation of Machine Translation}, booktitle = {Proceedings of the Workshop on Statistical Machine Translation}, month = {June}, year = {2014}, address = {Baltimore, Maryland}, publisher = {Association for Computational Linguistics} }

Low-Resource Semantic Role Labeling
Matt Gormley, Margaret Mitchell, Benjamin Van Durme and Mark Dredze
Association for Computational Linguistics (ACL) – 2014

[pdf] | [bib]

@inproceedings{gormley-etal:2014:SRL, author = {Gormley, Matt and Mitchell, Margaret and Van Durme, Benjamin and Dredze, Mark}, title = {Low-Resource Semantic Role Labeling}, booktitle = {Association for Computational Linguistics (ACL)}, month = {June}, year = {2014}, url = {http://www.cs.jhu.edu/~mrg/publications/srl-acl-2014.pdf} }

Robust Feature Extraction Using Modulation Filtering of Autoregressive Models
Sriram Ganapathy, Sri Harish and Hynek Hermansky
2014

[abstract] [pdf] | [bib]

Abstract

Speaker and language recognition in noisy and degraded channel conditions continue to be a challenging problem mainly due to the mismatch between clean training and noisy test conditions. In the presence of noise, the most reliable portions of the signal are the high energy regions which can be used for robust feature extraction. In this paper, we propose a front end processing scheme based on autoregressive (AR) models that represent the high energy regions with good accuracy followed by a modulation filtering process. The AR model of the spectrogram is derived using two separable time and frequency AR transforms. The first AR model (temporal AR model) of the sub-band Hilbert envelopes is derived using frequency domain linear prediction (FDLP). This is followed by a spectral AR model applied on the FDLP envelopes. The output 2-D AR model represents a low-pass modulation filtered spectrogram of the speech signal. The band-pass modulation filtered spectrograms can further be derived by dividing two AR models with different model orders (cut-off frequencies). The modulation filtered spectrograms are converted to cepstral coefficients and are used for a speaker recognition task in noisy and reverberant conditions. Various speaker recognition experiments are performed with clean and noisy versions of the NIST-2010 speaker recognition evaluation (SRE) database using the state-of-the-art speaker recognition system. In these experiments, the proposed front-end analysis provides substantial improvements (relative improvements of up to 25%) compared to baseline techniques. Furthermore, we also illustrate the generalizability of the proposed methods using language identification (LID) experiments on highly degraded high-frequency (HF) radio channels and speech recognition experiments on noisy data.
@{, author = {Ganapathy, Sriram and Sri Harish and Hynek Hermansky}, title = {Robust Feature Extraction Using Modulation Filtering of Autoregressive Models}, month = {June}, year = {2014}, publisher = {IEEE}, url = {http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6826560&queryText%3DRobust+Feature+Extraction+Using+Modulation+Filtering+of+Autoregressive+models}, abstract = {Speaker and language recognition in noisy and degraded channel conditions continue to be a challenging problem mainly due to the mismatch between clean training and noisy test conditions. In the presence of noise, the most reliable portions of the signal are the high energy regions which can be used for robust feature extraction. In this paper, we propose a front end processing scheme based on autoregressive (AR) models that represent the high energy regions with good accuracy followed by a modulation filtering process. The AR model of the spectrogram is derived using two separable time and frequency AR transforms. The first AR model (temporal AR model) of the sub-band Hilbert envelopes is derived using frequency domain linear prediction (FDLP). This is followed by a spectral AR model applied on the FDLP envelopes. The output 2-D AR model represents a low-pass modulation filtered spectrogram of the speech signal. The band-pass modulation filtered spectrograms can further be derived by dividing two AR models with different model orders (cut-off frequencies). The modulation filtered spectrograms are converted to cepstral coefficients and are used for a speaker recognition task in noisy and reverberant conditions. Various speaker recognition experiments are performed with clean and noisy versions of the NIST-2010 speaker recognition evaluation (SRE) database using the state-of-the-art speaker recognition system. In these experiments, the proposed front-end analysis provides substantial improvements (relative improvements of up to 25%) compared to baseline techniques. Furthermore, we also illustrate the generalizability of the proposed methods using language identification (LID) experiments on highly degraded high-frequency (HF) radio channels and speech recognition experiments on noisy data.} }

Findings of the 2014 Workshop on Statistical Machine Translation
Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia and Aleš Tamchyna
Proceedings of the Ninth Workshop on Statistical Machine Translation – 2014

[bib]

@inproceedings{bojar-EtAl:2014:W14-33, author = {Ondrej Bojar and Christian Buck and Christian Federmann and Barry Haddow and Koehn, Philipp and Johannes Leveling and Christof Monz and Pavel Pecina and Post, Matt and Herve Saint-Amand and Radu Soricut and Lucia Specia and Aleš Tamchyna}, title = {Findings of the 2014 Workshop on Statistical Machine Translation}, booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation}, month = {June}, year = {2014}, address = {Baltimore, Maryland, USA}, publisher = {Association for Computational Linguistics}, pages = {12--58}, url = {http://aclweb.org/anthology/W/W14/W14-3302.pdf} }

Summary and Initial Results of the 2013-2014 Speaker Recognition i-vector Machine Learning Challenge
Dsir Bans, George R. Doddington, Daniel Garcia-Romero, John J. Godfrey, Craig S. Greenberg, Alvin F. Martin, Alan McCree, Mark Przybocki and Douglas A. Reynolds
2014

[abstract] [pdf] | [bib]

Abstract

During late-2013 through mid-2014 NIST coordinated a special machine learning challenge based on the i-vector paradigm widely used by state-of-the-art speaker recognition systems. The i-vector challenge was run entirely online and used as source data fixed-length feature vectors projected into a low-dimensional space (i-vectors) rather than audio recordings. These changes made the challenge more readily accessible, enabled system comparison with consistency in the front-end and in the amount and type of training data, and facilitated exploration of many more approaches than would be possible in a single evaluation as traditionally run by NIST. Compared to the 2012 NIST Speaker Recognition Evaluation, the i-vector challenge saw approximately twice as many participants, and a nearly two orders of magnitude increase in the number of systems submitted for evaluation. Initial results indicate that the leading system achieved a relative improvement of approximately 38% over the baseline system.
@{McCree:2014, author = {Dsir Bans and George R. Doddington and Daniel Garcia-Romero and John J. Godfrey and Craig S. Greenberg and Alvin F. Martin and Alan McCree and Mark Przybocki and Douglas A. Reynolds}, title = {Summary and Initial Results of the 2013-2014 Speaker Recognition i-vector Machine Learning Challenge}, month = {June}, year = {2014}, publisher = {Inter}, abstract = {During late-2013 through mid-2014 NIST coordinated a special machine learning challenge based on the i-vector paradigm widely used by state-of-the-art speaker recognition systems. The i-vector challenge was run entirely online and used as source data fixed-length feature vectors projected into a low-dimensional space (i-vectors) rather than audio recordings. These changes made the challenge more readily accessible, enabled system comparison with consistency in the front-end and in the amount and type of training data, and facilitated exploration of many more approaches than would be possible in a single evaluation as traditionally run by NIST. Compared to the 2012 NIST Speaker Recognition Evaluation, the i-vector challenge saw approximately twice as many participants, and a nearly two orders of magnitude increase in the number of systems submitted for evaluation. Initial results indicate that the leading system achieved a relative improvement of approximately 38% over the baseline system.} }


Daniel Garcia Romero, Alan McCree, Stephen Shum, Niko Brummer and Carlos Vaquero
2014

[abstract] [pdf] | [bib]

Abstract

In this paper, we present a framework for unsupervised domain adaptation of PLDA based i-vector speaker recognition systems. Given an existing out-of-domain PLDA system, we use it to cluster unlabeled in-domain data, and then use this data to adapt the parameters of the PLDA system. We explore two versions of agglomerative hierarchical clustering that use the PLDA system. We also study two automatic ways to determine the number of clusters in the in-domain dataset. The proposed techniques are experimentally validated in the recently introduced domain adaptation challenge. This challenge provides a very useful setup to explore domain adaptation since it illustrates a significant performance gap between an in-domain and out-of-domain system. Using agglomerative hierarchical clustering with a stopping criterion based on unsupervised calibration we are able to recover 85% of this gap.
@{2014, author = {Daniel Garcia Romero and Alan McCree and Stephen Shum and Niko Brummer and Carlos Vaquero}, title = {}, month = {June}, year = {2014}, publisher = {Odyssey}, abstract = {In this paper, we present a framework for unsupervised domain adaptation of PLDA based i-vector speaker recognition systems. Given an existing out-of-domain PLDA system, we use it to cluster unlabeled in-domain data, and then use this data to adapt the parameters of the PLDA system. We explore two versions of agglomerative hierarchical clustering that use the PLDA system. We also study two automatic ways to determine the number of clusters in the in-domain dataset. The proposed techniques are experimentally validated in the recently introduced domain adaptation challenge. This challenge provides a very useful setup to explore domain adaptation since it illustrates a significant performance gap between an in-domain and out-of-domain system. Using agglomerative hierarchical clustering with a stopping criterion based on unsupervised calibration we are able to recover 85% of this gap.} }

UNSUPERVISED DOMAIN ADAPTATION FOR I-VECTOR SPEAKER RECOGNITION
Daniel Garcia-Romero, Alan McCree, Stephen Shum, Niko Brummer and Carlos Vaquero
2014

[abstract] [bib]

Abstract

In this paper, we present a framework for unsupervised domain adaptation of PLDA based i-vector speaker recognition systems. Given an existing out-of-domain PLDA system, we use it to cluster unlabeled in-domain data, and then use this data to adapt the parameters of the PLDA system. We explore two versions of agglomerative hierarchical clustering that use the PLDA system. We also study two automatic ways to determine the number of clusters in the in-domain dataset. The proposed techniques are experimentally validated in the recently introduced domain adaptation challenge. This challenge provides a very useful setup to explore domain adaptation since it illustrates a significant performance gap between an in-domain and out-of-domain system. Using agglomerative hierarchical clustering with a stopping criterion based on unsupervised calibration we are able to recover 85% of this gap.
@{Romero:McCree:2014, author = {Garcia-Romero, Daniel and McCree, Alan and Stephen Shum and Niko Brummer and Carlos Vaquero}, title = {UNSUPERVISED DOMAIN ADAPTATION FOR I-VECTOR SPEAKER RECOGNITION}, month = {June}, year = {2014}, publisher = {Odyssey}, abstract = {In this paper, we present a framework for unsupervised domain adaptation of PLDA based i-vector speaker recognition systems. Given an existing out-of-domain PLDA system, we use it to cluster unlabeled in-domain data, and then use this data to adapt the parameters of the PLDA system. We explore two versions of agglomerative hierarchical clustering that use the PLDA system. We also study two automatic ways to determine the number of clusters in the in-domain dataset. The proposed techniques are experimentally validated in the recently introduced domain adaptation challenge. This challenge provides a very useful setup to explore domain adaptation since it illustrates a significant performance gap between an in-domain and out-of-domain system. Using agglomerative hierarchical clustering with a stopping criterion based on unsupervised calibration we are able to recover 85% of this gap.} }

Some Insights From Translating Conversational Telephone Speech
Gaurav Kumar, Matt Post, Daniel Povey and Sanjeev Khudanpur
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) – 2014

[bib]

@inproceedings{kumar2014some, author = {Gaurav Kumar and Post, Matt and Povey, Daniel and Khudanpur, Sanjeev}, title = {Some Insights From Translating Conversational Telephone Speech}, booktitle = {International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}, month = {May}, year = {2014}, address = {Florence, Italy}, url = {http://cs.jhu.edu/~post/papers/kumar2013some.pdf} }

SUPERVISED DOMAIN ADAPTATION FOR I-VECTOR BASED SPEAKER RECOGNITION
Alan McCree and Daniel Garcia-Romero
2014

[abstract] [pdf] | [bib]

Abstract

In this paper, we present a comprehensive study on supervised domain adaptation of PLDA based i-vector speaker recognition systems. After describing the system parameters subject to adaptation, we study the impact of their adaptation on recognition performance. Using the recently designed domain adaptation challenge, we observe that the adaptation of the PLDA parameters (i.e. across-class and within-class co variances) produces the largest gains. Nonetheless, length-normalization is also important; whereas using an indomani UBM and T matrix is not crucial. For the PLDA adaptation, we compare four approaches. Three of them are proposed in this work, and a fourth one was previously published. Overall, the four techniques are successful at leveraging varying amounts of labeled in-domain data and their performance is quite similar. However, our approaches are less involved, and two of them are applicable to a larger class of models (low-rank across-class).
@{, author = {McCree, Alan and Garcia-Romero, Daniel}, title = {SUPERVISED DOMAIN ADAPTATION FOR I-VECTOR BASED SPEAKER RECOGNITION}, institution = {Human Language Technology Center of Excellence, Johns Hopkins University}, month = {May}, year = {2014}, publisher = {IEEE}, pages = {4047 - 4051}, abstract = {In this paper, we present a comprehensive study on supervised domain adaptation of PLDA based i-vector speaker recognition systems. After describing the system parameters subject to adaptation, we study the impact of their adaptation on recognition performance. Using the recently designed domain adaptation challenge, we observe that the adaptation of the PLDA parameters (i.e. across-class and within-class co variances) produces the largest gains. Nonetheless, length-normalization is also important; whereas using an indomani UBM and T matrix is not crucial. For the PLDA adaptation, we compare four approaches. Three of them are proposed in this work, and a fourth one was previously published. Overall, the four techniques are successful at leveraging varying amounts of labeled in-domain data and their performance is quite similar. However, our approaches are less involved, and two of them are applicable to a larger class of models (low-rank across-class).} }

Multiclass Discriminative Training of i-vector Language Recognition
Alan McCree
Odyssey – 2014

[abstract] [pdf] | [bib]

Abstract

The current state-of-the-art for acoustic language recognition is an i-vector classifier followed by a discriminatively-trained multiclass back-end. This paper presents a unified approach, where a Gaussian i-vector classifier is trained using Maximum Mutual Information (MMI) to directly optimize the multiclass calibration criterion, so that no separate back-end is needed. The system is extended to the open set task by training an additional Gaussian model. Results on the NIST LRE11 standard evaluation task confirm that high performance is maintained with this new single-stage approach.
@inproceedings{, author = {McCree, Alan}, title = {Multiclass Discriminative Training of i-vector Language Recognition}, booktitle = {Odyssey}, month = {May}, year = {2014}, abstract = {The current state-of-the-art for acoustic language recognition is an i-vector classifier followed by a discriminatively-trained multiclass back-end. This paper presents a unified approach, where a Gaussian i-vector classifier is trained using Maximum Mutual Information (MMI) to directly optimize the multiclass calibration criterion, so that no separate back-end is needed. The system is extended to the open set task by training an additional Gaussian model. Results on the NIST LRE11 standard evaluation task confirm that high performance is maintained with this new single-stage approach.} }

IMPROVING SPEAKER RECOGNITION PERFORMANCE IN THE DOMAIN ADAPTATION CHALLENGE USING DEEP NEURAL NETWORKS
Alan McCree, Daniel Garcia-Romero, Xaiohui Zhang, Daniel Povey and
2014

[abstract] [pdf] | [bib]

Abstract

Traditional i-vector speaker recognition systems use a Gaussian mixture model (GMM) to collect sufficient statistics (SS). Recently, replacing this GMM with a deep neural network (DNN) has shown promising results. In this paper, we explore the use of DNNs to collect SS for the unsupervised domain adaptation task of the Domain Adaptation Challenge (DAC).We show that collecting SS with a DNN trained on out-of-domain data boosts the speaker recognition performance of an out-of-domain system by more than 25%. Moreover, we integrate the DNN in an unsupervised adaptation framework, that uses agglomerative hierarchical clustering with a stopping criterion based on unsupervised calibration, and show that the initial gains of the out-of-domain system carry over to the final adapted system. Despite the fact that the DNN is trained on the out-of-domain data, the final adapted system produces a relative improvement of more than 30% with respect to the best published results on this task.
@{McCree:2014, author = {McCree, Alan and Garcia-Romero, Daniel and Xaiohui Zhang and Povey, Daniel and }, title = {IMPROVING SPEAKER RECOGNITION PERFORMANCE IN THE DOMAIN ADAPTATION CHALLENGE USING DEEP NEURAL NETWORKS}, institution = {Human Language Technology Center of Excellence & Center for Language and Speech Processing The Johns Hopkins University,}, month = {May}, year = {2014}, abstract = {Traditional i-vector speaker recognition systems use a Gaussian mixture model (GMM) to collect sufficient statistics (SS). Recently, replacing this GMM with a deep neural network (DNN) has shown promising results. In this paper, we explore the use of DNNs to collect SS for the unsupervised domain adaptation task of the Domain Adaptation Challenge (DAC).We show that collecting SS with a DNN trained on out-of-domain data boosts the speaker recognition performance of an out-of-domain system by more than 25%. Moreover, we integrate the DNN in an unsupervised adaptation framework, that uses agglomerative hierarchical clustering with a stopping criterion based on unsupervised calibration, and show that the initial gains of the out-of-domain system carry over to the final adapted system. Despite the fact that the DNN is trained on the out-of-domain data, the final adapted system produces a relative improvement of more than 30% with respect to the best published results on this task.} }

Population Health Concerns During the United States' Great Recession
Ben Althouse, Jon-Patrick Allem, Matt Childers, Mark Dredze and John Ayers
American Journal of Preventive Medicine – 2014

[bib]

@article{Althouse:2014lr, author = {Ben Althouse and Jon-Patrick Allem and Matt Childers and Dredze, Mark and John Ayers}, title = {Population Health Concerns During the United States' Great Recession}, month = {February}, year = {2014}, pages = {166-170}, url = {http://www.ajpmonline.org/article/S0749-3797(13)00581-3/abstract} }

The Language Demographics of Amazon Mechanical Turk
Ellie Pavlick, Matt Post, Ann Irvine, Dmitry Kachaev and Chris Callison-Burch
Transactions of the Association for Computational Linguistics – 2014

[bib]

@article{pavlick2014language, author = {Ellie Pavlick and Post, Matt and Irvine, Ann and Dmitry Kachaev and Callison-Burch, Chris}, title = {The Language Demographics of Amazon Mechanical Turk}, month = {February}, year = {2014}, pages = {79--92}, url = {http://www.cis.upenn.edu/~ccb/publications/language-demographics-of-mechanical-turk.pdf} }

Improving Gender Prediction of Social Media Users via Weighted Annotator Rationales
Svitlana Volkova and David Yarowsky
NIPS 2014 Workshop on Personalization: Methods and Applications – 2014

[bib]

@inproceedings{volkova-yarowsky:2014, author = {Volkova, Svitlana and Yarowsky, David}, title = {Improving Gender Prediction of Social Media Users via Weighted Annotator Rationales}, booktitle = {NIPS 2014 Workshop on Personalization: Methods and Applications}, month = {December}, year = {2014}, address = {Montreal, Canada} }

KELVIN: Extracting Knowledge from Large Text Collections
James Mayfield, Paul McNamee, Craig Harmon, Tim Finin and Dawn Lawrie
AAAI Fall Symposium on Natural Language Access to Big Data – 2014

[pdf] | [bib]

@inproceedings{KELVIN_Extracting_Knowledge_from_Large_Text_Collections, author = {Mayfield, James and McNamee, Paul and Craig Harmon and Finin, Tim and Lawrie, Dawn}, title = {KELVIN: Extracting Knowledge from Large Text Collections}, booktitle = {AAAI Fall Symposium on Natural Language Access to Big Data}, month = {November}, year = {2014}, publisher = {AAAI Press} }

Infoboxer: Using Statistical and Semantic Knowledge to Help Create Wikipedia Infoboxes
Roberto Yus, Varish Mulwad, Tim Finin and Eduardo Mena
13th International Semantic Web Conference (ISWC 2014), Riva del Garda (Italy) – 2014

[pdf] | [bib]

@inproceedings{Infoboxer_Using_Statistical_and_Semantic_Knowledge_to_Help_Create_Wikipedia_Infoboxes, author = {Roberto Yus and Varish Mulwad and Finin, Tim and Eduardo Mena}, title = {Infoboxer: Using Statistical and Semantic Knowledge to Help Create Wikipedia Infoboxes}, booktitle = {13th International Semantic Web Conference (ISWC 2014), Riva del Garda (Italy)}, month = {October}, year = {2014} }

Reducing Reliance on Relevance Judgments for System Comparison by Using Expectation-Maximization
Ning Gao, William Webber and Douglas W Oard
The 36th European Conference on Information Retrieval – 2014

[bib]

@inproceedings{Gao2014ECIR, author = {Ning Gao and William Webber and Douglas W Oard}, title = {Reducing Reliance on Relevance Judgments for System Comparison by Using Expectation-Maximization}, booktitle = {The 36th European Conference on Information Retrieval}, year = {2014}, publisher = {Springer}, pages = {1--12}, url = {http://terpconnect.umd.edu/~oard/pdf/ecir14.pdf} }

A Wikipedia-based Corpus for Contextualized Machine Translation.
Jennifer Drexler, Pushpendre Rastogi, Jacqueline Aguilar, Benjamin Van Durme and Matt Post
Proceedings of the Eighth international Conference on Language Resources and Evaluation (LREC) – 2014

[pdf] | [bib]

@inproceedings{Drexler2014, author = {Jennifer Drexler and Pushpendre Rastogi and Aguilar, Jacqueline and Van Durme, Benjamin and Post, Matt}, title = {A Wikipedia-based Corpus for Contextualized Machine Translation.}, booktitle = {Proceedings of the Eighth international Conference on Language Resources and Evaluation (LREC)}, year = {2014} }

Learning Polylingual Topic Models from Code-Switched Social Media Documents
Nanyun Peng, Yiming Wang and Mark Dredze
Association for Computational Linguistics (ACL) – 2014

[abstract] [bib]

Abstract

Code-switched documents are common in social media, providing evidence for polylingual topic models to infer aligned topics across languages. We present Code-Switched LDA (csLDA), which infers language specific topic distributions based on code-switched documents to facilitate multi-lingual corpus analysis. We experiment on two code-switching corpora (English-Spanish Twitter data and English-Chinese Weibo data) and show that csLDA improves perplexity over LDA, and learns semantically coherent aligned topics as judged by human annotators.
@inproceedings{Peng:2014fk, author = {Nanyun Peng and Yiming Wang and Dredze, Mark}, title = {Learning Polylingual Topic Models from Code-Switched Social Media Documents}, booktitle = {Association for Computational Linguistics (ACL)}, year = {2014}, url = {http://www.aclweb.org/anthology/P/P14/P14-2110.pdf}, abstract = {Code-switched documents are common in social media, providing evidence for polylingual topic models to infer aligned topics across languages. We present Code-Switched LDA (csLDA), which infers language specific topic distributions based on code-switched documents to facilitate multi-lingual corpus analysis. We experiment on two code-switching corpora (English-Spanish Twitter data and English-Chinese Weibo data) and show that csLDA improves perplexity over LDA, and learns semantically coherent aligned topics as judged by human annotators.} }

Measuring Post Traumatic Stress Disorder in Twitter
Glen Coppersmith, Craig Harman and Mark Dredze
International Conference on Weblogs and Social Media (ICWSM) – 2014

[abstract] [pdf] | [bib]

Abstract

Traditional mental health studies rely on information primarily collected and analyzed through personal contact with a health care professional. Recent work has shown the utility of social media data for studying depression, but there have been limited evaluations of other mental health conditions. We consider post traumatic stress disorder (PTSD), a serious condition that affects millions worldwide, with especially high rates in military veterans. We show how to obtain a PTSD classifier for social media using simple searches of available Twitter data, a significant reduction in training data cost compared to previous work on mental health. We demonstrate its utility by an examination of language use from PTSD individuals, and by detecting elevated rates of PTSD at and around US military bases using our classifiers.
@inproceedings{Coppersmith:2014lr, author = {Coppersmith, Glen and Harman, Craig and Dredze, Mark}, title = {Measuring Post Traumatic Stress Disorder in Twitter}, booktitle = {International Conference on Weblogs and Social Media (ICWSM)}, year = {2014}, abstract = {Traditional mental health studies rely on information primarily collected and analyzed through personal contact with a health care professional. Recent work has shown the utility of social media data for studying depression, but there have been limited evaluations of other mental health conditions. We consider post traumatic stress disorder (PTSD), a serious condition that affects millions worldwide, with especially high rates in military veterans. We show how to obtain a PTSD classifier for social media using simple searches of available Twitter data, a significant reduction in training data cost compared to previous work on mental health. We demonstrate its utility by an examination of language use from PTSD individuals, and by detecting elevated rates of PTSD at and around US military bases using our classifiers.} }

Robust Entity Clustering via Phylogenetic Inference
Nicholas Andrews, Jason Eisner and Mark Dredze
Association for Computational Linguistics (ACL) – 2014

[abstract] [bib]

Abstract

Entity clustering must determine when two named-entity mentions refer to the same entity. Typical approaches use a pipeline architecture that clusters the mentions using fixed or learned measures of name and context similarity. In this paper, we propose a model for cross-document coreference resolution that achieves robustness by learning similarity from unlabeled data. The generative process assumes that each entity mention arises from copying and optionally mutating an earlier name from a similar context. Clustering the mentions into entities depends on recovering this copying tree jointly with estimating models of the mutation process and parent selection process. We present a block Gibbs sampler for posterior inference and an empirical evalution on several datasets. On a challenging Twitter corpus, our method outperforms the best baseline by 12.6 points of F1 score.
@inproceedings{Andrews:2014fk, author = {Andrews, Nicholas and Eisner, Jason and Dredze, Mark}, title = {Robust Entity Clustering via Phylogenetic Inference}, booktitle = {Association for Computational Linguistics (ACL)}, year = {2014}, abstract = {Entity clustering must determine when two named-entity mentions refer to the same entity. Typical approaches use a pipeline architecture that clusters the mentions using fixed or learned measures of name and context similarity. In this paper, we propose a model for cross-document coreference resolution that achieves robustness by learning similarity from unlabeled data. The generative process assumes that each entity mention arises from copying and optionally mutating an earlier name from a similar context. Clustering the mentions into entities depends on recovering this copying tree jointly with estimating models of the mutation process and parent selection process. We present a block Gibbs sampler for posterior inference and an empirical evalution on several datasets. On a challenging Twitter corpus, our method outperforms the best baseline by 12.6 points of F1 score.} }

Quantifying Mental Health Signals in Twitter
Glen Coppersmith, Mark Dredze and Craig Harman
ACL Workshop on Computational Linguistics and Clinical Psychology – 2014

[abstract] [pdf] | [bib]

Abstract

The ubiquity of social media provides a rich opportunity to enhance the data available to mental health clinicians and researchers, enabling a better-informed and better-equipped mental health field. We present analysis of mental health phenomena in publicly available Twitter data, demonstrating how rigorous application of simple natural language processing methods can yield insight into specific disorders as well as mental health writ large, along with evidence that as-of-yet undiscovered linguistic signals relevant to mental health exist in social media. We present a novel method for gathering data for a range of mental illnesses quickly and cheaply, then focus on analysis of four in particular: post-traumatic stress disorder (PTSD), major depressive disorder, bipolar disorder, and seasonal affective disorder. We intend for these proof-of-concept results to inform the necessary ethical discussion regarding the balance between the utility of such data and the privacy of mental health related information.
@inproceedings{Coppersmith:2014fk, author = {Coppersmith, Glen and Dredze, Mark and Harman, Craig}, title = {Quantifying Mental Health Signals in Twitter}, booktitle = {ACL Workshop on Computational Linguistics and Clinical Psychology}, year = {2014}, abstract = {The ubiquity of social media provides a rich opportunity to enhance the data available to mental health clinicians and researchers, enabling a better-informed and better-equipped mental health field. We present analysis of mental health phenomena in publicly available Twitter data, demonstrating how rigorous application of simple natural language processing methods can yield insight into specific disorders as well as mental health writ large, along with evidence that as-of-yet undiscovered linguistic signals relevant to mental health exist in social media. We present a novel method for gathering data for a range of mental illnesses quickly and cheaply, then focus on analysis of four in particular: post-traumatic stress disorder (PTSD), major depressive disorder, bipolar disorder, and seasonal affective disorder. We intend for these proof-of-concept results to inform the necessary ethical discussion regarding the balance between the utility of such data and the privacy of mental health related information.} }

A Comparison of the Events and Relations Across ACE, ERE, TAC-KBP, and FrameNet Annotation Standards.
Jacqueline Aguilar, Charley Beller, Paul McNamee, Benjamin Van Durme, Stephanie Strassel, Zhiyi Song and Joe Ellis
ACL Workshop: EVENTS – 2014

[pdf] | [bib]

@inproceedings{Aguilar2014, author = {Aguilar, Jacqueline and Charley Beller and McNamee, Paul and Van Durme, Benjamin and Stephanie Strassel and Zhiyi Song and Joe Ellis}, title = {A Comparison of the Events and Relations Across ACE, ERE, TAC-KBP, and FrameNet Annotation Standards.}, booktitle = {ACL Workshop: EVENTS}, year = {2014}, url = {https://www.aclweb.org/anthology/W/W14/W14-2907.pdf} }

Facebook, Twitter and Google Plus for Breaking News: Is there a winner?
Miles Osborne and Mark Dredze
International Conference on Weblogs and Social Media (ICWSM) – 2014

[abstract] [bib]

Abstract

Twitter is widely seen as being the go to place for breaking news. Recently however, competing Social Media have begun to carry news. Here we examine how Facebook, Google Plus and Twitter report on breaking news. We consider coverage (whether news events are reported) and latency (the time when they are reported). Using data drawn from three weeks in December 2013, we identify 29 major news events, ranging from celebrity deaths, plague outbreaks to sports events. We find that all media carry the same major events, but Twitter continues to be the preferred medium for breaking news, almost consistently leading Facebook or Google Plus. Facebook and Google Plus largely repost newswire stories and their main research value is that they conveniently package multitple sources of information together.
@inproceedings{Osborne:2014fk, author = {Miles Osborne and Dredze, Mark}, title = {Facebook, Twitter and Google Plus for Breaking News: Is there a winner?}, booktitle = {International Conference on Weblogs and Social Media (ICWSM)}, year = {2014}, url = {http://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8072}, abstract = {Twitter is widely seen as being the go to place for breaking news. Recently however, competing Social Media have begun to carry news. Here we examine how Facebook, Google Plus and Twitter report on breaking news. We consider coverage (whether news events are reported) and latency (the time when they are reported). Using data drawn from three weeks in December 2013, we identify 29 major news events, ranging from celebrity deaths, plague outbreaks to sports events. We find that all media carry the same major events, but Twitter continues to be the preferred medium for breaking news, almost consistently leading Facebook or Google Plus. Facebook and Google Plus largely repost newswire stories and their main research value is that they conveniently package multitple sources of information together.} }

Featherweight Phonetic Keyword Search for Conversational Speech
Keith Kintzley, Aren Jansen and Hynek Hermansky
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) – 2014

[pdf] | [bib]

@inproceedings{kintzleyfeatherweight, author = {Keith Kintzley and Jansen, Aren and Hermansky, Hynek}, title = {Featherweight Phonetic Keyword Search for Conversational Speech}, booktitle = {International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}, year = {2014} }

Unsupervised Idiolect Discovery for Speaker Recognition
Aren Jansen, Daniel Garcia-Romero, Pascal Clark and Jaime Hernandez-Cordero
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) – 2014

[abstract] [pdf] | [bib]

Abstract

Short-time spectral characterizations of the human voice have proven to be the most dependable features available to modern speaker recognition systems. However, it is well-known that highlevel linguistic information such as word usage and pronunciation patterns can provide complementary discriminative power. In an automatic setting, the availability of these idiolectal cues is dependent on access to a word or phonetic tokenizer, ideally in the given language and domain. In this paper, we propose a novel approach to speaker recognition that leverages recently developed zero-resource term discovery algorithms to identify speaker-characteristic lexical and phrasal acoustic patterns without the need for any supervised speech recognition tools. We use the enrollment audio itself to score each trial and perform no model training (supervised or unsupervised) at any stage of the processing, allowing immediate application to any language or domain. We evaluate our approach on the extended 8-conversation core condition of the 2010 NIST SRE and demonstrate a 16% relative (0.06 absolute) reduction in minDCF when combined with a state-of-the-art unsupervised i-vector cosine system.
@inproceedings{jansenidiolect, author = {Jansen, Aren and Garcia-Romero, Daniel and Clark, Pascal and Jaime Hernandez-Cordero}, title = {Unsupervised Idiolect Discovery for Speaker Recognition}, booktitle = {International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}, year = {2014}, abstract = {Short-time spectral characterizations of the human voice have proven to be the most dependable features available to modern speaker recognition systems. However, it is well-known that highlevel linguistic information such as word usage and pronunciation patterns can provide complementary discriminative power. In an automatic setting, the availability of these idiolectal cues is dependent on access to a word or phonetic tokenizer, ideally in the given language and domain. In this paper, we propose a novel approach to speaker recognition that leverages recently developed zero-resource term discovery algorithms to identify speaker-characteristic lexical and phrasal acoustic patterns without the need for any supervised speech recognition tools. We use the enrollment audio itself to score each trial and perform no model training (supervised or unsupervised) at any stage of the processing, allowing immediate application to any language or domain. We evaluate our approach on the extended 8-conversation core condition of the 2010 NIST SRE and demonstrate a 16% relative (0.06 absolute) reduction in minDCF when combined with a state-of-the-art unsupervised i-vector cosine system.} }

Bridging the Gap between Speech Technology and Natural Language Processing: An Evaluation Toolbox for Term Discovery Systems
Bogdan Ludusan, Maarten Versteegh, Aren Jansen, Guillaume Gravier, Xuan-Nga Cao, Mark Johnson and Emmanuel Dupoux
Proceedings of the Eighth international Conference on Language Resources and Evaluation (LREC) – 2014

[pdf] | [bib]

@inproceedings{jansenlrec, author = {Bogdan Ludusan and Maarten Versteegh and Jansen, Aren and Guillaume Gravier and Xuan-Nga Cao and Mark Johnson and Emmanuel Dupoux}, title = {Bridging the Gap between Speech Technology and Natural Language Processing: An Evaluation Toolbox for Term Discovery Systems}, booktitle = {Proceedings of the Eighth international Conference on Language Resources and Evaluation (LREC)}, year = {2014} }

Could Behavioral Medicine Lead the Web Data Revolution?
John Ayers, Benjamin Althouse and Mark Dredze
Journal of the American Medical Association (JAMA) – 2014

[bib]

@article{Ayers:2014fk, author = {John Ayers and Benjamin Althouse and Dredze, Mark}, title = {Could Behavioral Medicine Lead the Web Data Revolution?}, year = {2014}, url = {http://jama.jamanetwork.com/article.aspx?articleid=1838433} }

Improving Lexical Embeddings with Semantic Knowledge
Mo Yu and Mark Dredze
Association for Computational Linguistics (ACL) – 2014

[abstract] [bib]

Abstract

Word embeddings learned on unlabeled data are a popular tool in semantics, but may not capture the desired semantics. We propose a new learning objective that incorporates both a neural language model objective and prior knowledge from semantic resources to learn improved lexical semantic embeddings. We demonstrate that our embeddings improve over those learned solely on raw text in three settings: language modeling, measuring semantic similarity, and predicting human judgements.
@inproceedings{Yu:2014, author = {Mo Yu and Dredze, Mark}, title = {Improving Lexical Embeddings with Semantic Knowledge}, booktitle = {Association for Computational Linguistics (ACL)}, year = {2014}, url = {http://www.aclweb.org/anthology/P14-2089}, abstract = {Word embeddings learned on unlabeled data are a popular tool in semantics, but may not capture the desired semantics. We propose a new learning objective that incorporates both a neural language model objective and prior knowledge from semantic resources to learn improved lexical semantic embeddings. We demonstrate that our embeddings improve over those learned solely on raw text in three settings: language modeling, measuring semantic similarity, and predicting human judgements.} }

What's the Healthiest Day? Circaseptan (Weekly) Rhythms in Healthy Considerations
John Ayers, Benjamin Althouse, Morgan Johnson, Mark Dredze and Joanna Cohen
American Journal of Preventive Medicine – 2014

[bib]

@article{Ayers:2014lr, author = {John Ayers and Benjamin Althouse and Morgan Johnson and Dredze, Mark and Joanna Cohen}, title = {What's the Healthiest Day? Circaseptan (Weekly) Rhythms in Healthy Considerations}, year = {2014}, url = {http://www.ajpmonline.org/article/S0749-3797(14)00099-3/abstract} }

Biases in Predicting the Human Language Model
Alex Fine, Austin Frank, T. Jaeger and Benjamin Van Durme
Association for Computational Linguistics (ACL), Short Papers – 2014

[pdf] | [bib]

@inproceedings{FineFrankJaegerVanDurmeACL14, author = {Alex Fine and Austin Frank and T. Jaeger and Van Durme, Benjamin}, title = {Biases in Predicting the Human Language Model}, booktitle = {Association for Computational Linguistics (ACL), Short Papers}, year = {2014} }

I'm a Belieber: Social Roles via Self-identification and Conceptual Attributes
Charley Beller, Rebecca Knowles, Craig Harman, Shane Bergsma, Margaret Mitchell and Benjamin Van Durme
Association for Computational Linguistics (ACL), Short Papers – 2014

[pdf] | [bib]

@inproceedings{BellerKnowlesHarmanBergsmaMitchellVanDurmeACL14, author = {Charley Beller and Rebecca Knowles and Harman, Craig and Bergsma, Shane and Mitchell, Margaret and Van Durme, Benjamin}, title = {I'm a Belieber: Social Roles via Self-identification and Conceptual Attributes}, booktitle = {Association for Computational Linguistics (ACL), Short Papers}, year = {2014}, url = {http://aclweb.org/anthology/P14-2030} }

Freebase QA: Information Extraction or Semantic Parsing?
Xuchen Yao, Jonathan Berant and Benjamin Van Durme
Association for Computational Linguistics (ACL), Workshop on Semantic Parsing – 2014

[pdf] | [bib]

@inproceedings{YaoBerantVanDurmeACL14, author = {Xuchen Yao and Jonathan Berant and Van Durme, Benjamin}, title = {Freebase QA: Information Extraction or Semantic Parsing?}, booktitle = {Association for Computational Linguistics (ACL), Workshop on Semantic Parsing}, year = {2014} }

Is the Stanford Dependency Representation Semantic?
Rachel Rudinger and Benjamin Van Durme
Association for Computational Linguistics (ACL), Workshop on EVENTS – 2014

[pdf] | [bib]

@inproceedings{RudingerVanDurmeACL14, author = {Rachel Rudinger and Van Durme, Benjamin}, title = {Is the Stanford Dependency Representation Semantic?}, booktitle = {Association for Computational Linguistics (ACL), Workshop on EVENTS}, year = {2014} }

Augmenting FrameNet Via PPDB
Pushpendre Rastogi and Benjamin Van Durme
Association for Computational Linguistics (ACL), Workshop on EVENTS – 2014

[pdf] | [bib]

@inproceedings{RastogiVanDurmeACL14, author = {Pushpendre Rastogi and Van Durme, Benjamin}, title = {Augmenting FrameNet Via PPDB}, booktitle = {Association for Computational Linguistics (ACL), Workshop on EVENTS}, year = {2014} }

Predicting Fine-grained Social Roles with Selectional Preferences
Charley Beller, Craig Harman and Benjamin Van Durme
Association for Computational Linguistics (ACL), Workshop on Language Technologies and Computational Social Science (LACSS) – 2014

[pdf] | [bib]

@inproceedings{BellerHarmanVanDurmeACL14, author = {Charley Beller and Harman, Craig and Van Durme, Benjamin}, title = {Predicting Fine-grained Social Roles with Selectional Preferences}, booktitle = {Association for Computational Linguistics (ACL), Workshop on Language Technologies and Computational Social Science (LACSS)}, year = {2014}, url = {https://www.aclweb.org/anthology/W/W14/W14-2515.pdf} }

Information Extraction over Structured Data: Question Answering with Freebase
Xuchen Yao and Benjamin Van Durme
Association for Computational Linguistics (ACL) – 2014

[pdf] | [bib]

@inproceedings{YaoVanDurmeACL14, author = {Xuchen Yao and Van Durme, Benjamin}, title = {Information Extraction over Structured Data: Question Answering with Freebase}, booktitle = {Association for Computational Linguistics (ACL)}, year = {2014} }

Inferring User Political Preferences from Streaming Communications
Svitlana Volkova, Glen Coppersmith and Benjamin Van Durme
Association for Computational Linguistics (ACL) – 2014

[pdf] | [bib]

@inproceedings{VolkovaCoppersmithVanDurmeACL14, author = {Volkova, Svitlana and Coppersmith, Glen and Van Durme, Benjamin}, title = {Inferring User Political Preferences from Streaming Communications}, booktitle = {Association for Computational Linguistics (ACL)}, year = {2014} }

Particle Filter Rejuvenation and Latent Dirichlet Allocation
Chandler May, Alex Clemmer and Benjamin Van Durme
Association for Computational Linguistics (ACL), Short Papers – 2014

[pdf] | [bib]

@inproceedings{MayClemmerVanDurmeACL14, author = {Chandler May and Alex Clemmer and Van Durme, Benjamin}, title = {Particle Filter Rejuvenation and Latent Dirichlet Allocation}, booktitle = {Association for Computational Linguistics (ACL), Short Papers}, year = {2014} }

Exponential Reservoir Sampling for Streaming Language Models
Miles Osborne, Ashwin Lall and Benjamin Van Durme
Association for Computational Linguistics (ACL), Short Papers – 2014

[pdf] | [bib]

@inproceedings{OsborneLallVanDurmeACL14, author = {Miles Osborne and Ashwin Lall and Van Durme, Benjamin}, title = {Exponential Reservoir Sampling for Streaming Language Models}, booktitle = {Association for Computational Linguistics (ACL), Short Papers}, year = {2014} }

A long, deep and wide artificial neural net for robust speech recognition in unknown noise
Feipeng Li, Phani Sankar Nidadavolu and Hynek Hermansky
2014

[abstract] [bib]

Abstract

A long deep and wide artificial neural net (LDWNN) with multiple ensemble neural nets for individual frequency subbands is proposed for robust speech recognition in unknown noise. It is assumed that the effect of arbitrary additive noise on speech recognition can be approximated by white noise (or speech-shaped noise) of similar level across multiple frequency subbands. The ensemble neural nets are trained in clean and speech-shaped noise at 20, 10, and 5 dB SNR to accommodate noise of different levels, followed by a neural net trained to select the most suitable neural net for optimum information extraction within a frequency subband. The posteriors from multiple frequency subbands are fused by another neural net to give a more reliable estimation. Experimental results show that the subband ensemble net adapts well to unknown noise.
@{, author = {Feipeng Li and Phani Sankar Nidadavolu and Hermansky, Hynek}, title = {A long, deep and wide artificial neural net for robust speech recognition in unknown noise}, year = {2014}, publisher = {INTERSPEECH}, url = {http://www.researchgate.net/publication/261707505_A_long_deep_and_wide_artificial_neural_net_for_robust_speech_recognition_in_unknown_noise}, abstract = {A long deep and wide artificial neural net (LDWNN) with multiple ensemble neural nets for individual frequency subbands is proposed for robust speech recognition in unknown noise. It is assumed that the effect of arbitrary additive noise on speech recognition can be approximated by white noise (or speech-shaped noise) of similar level across multiple frequency subbands. The ensemble neural nets are trained in clean and speech-shaped noise at 20, 10, and 5 dB SNR to accommodate noise of different levels, followed by a neural net trained to select the most suitable neural net for optimum information extraction within a frequency subband. The posteriors from multiple frequency subbands are fused by another neural net to give a more reliable estimation. Experimental results show that the subband ensemble net adapts well to unknown noise.} }

The Machine Translation Leaderboard
Matt Post and Adam Lopez
The Prague Bulletin of Mathematical Linguistics – 2014

[bib]

@article{post2014machine, author = {Post, Matt and Lopez, Adam}, title = {The Machine Translation Leaderboard}, year = {2014}, pages = {37--46}, url = {http://cs.jhu.edu/~post/papers/post-lopez-2014-mt-leaderboard.pdf} }

Music Tonality Features for Speech/Music Discrimination
Greg Sell and Pascal Clark
Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP) – 2014

[pdf] | [bib]

@inproceedings{Sell.Clark:2014A, author = {Sell, Greg and Clark, Pascal}, title = {Music Tonality Features for Speech/Music Discrimination}, booktitle = {Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}, year = {2014} }

Automatic Carrier Pitch Estimation for Coherent Demodulation
Greg Sell
Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP) – 2014

[pdf] | [bib]

@inproceedings{Sell:2014A, author = {Sell, Greg}, title = {Automatic Carrier Pitch Estimation for Coherent Demodulation}, booktitle = {Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}, year = {2014} }

Speaker Diarization with PLDA I-Vector Scoring and Unsupervised Calibration
Greg Sell and Daniel Garcia-Romero
Proceedings of the IEEE Spoken Language Technology Workshop – 2014

[pdf] | [bib]

@inproceedings{Sell.Garcia-Romero:2014A, author = {Sell, Greg and Garcia-Romero, Daniel}, title = {Speaker Diarization with PLDA I-Vector Scoring and Unsupervised Calibration}, booktitle = {Proceedings of the IEEE Spoken Language Technology Workshop}, year = {2014} }

UNSUPERVISED LEXICAL CLUSTERING OF SPEECH SEGMENTS USING FIXED-DIMENSIONAL ACOUSTIC EMBEDDINGS
Herman Kamper, Aren Jansen, Simon King and Sharon Goldwater
IEEE Workshop on Spoken Language Technology – 2014

[pdf] | [bib]

@article{kamperunsupervised, author = {Herman Kamper and Jansen, Aren and Simon King and Sharon Goldwater}, title = {UNSUPERVISED LEXICAL CLUSTERING OF SPEECH SEGMENTS USING FIXED-DIMENSIONAL ACOUSTIC EMBEDDINGS}, booktitle = {IEEE Workshop on Spoken Language Technology}, year = {2014} }

A KEYWORD SEARCH SYSTEM USING OPEN SOURCE SOFTWARE
Jan Trmal, Guoguo Chen, Daniel Povey, Sanjeev Khudanpur, Pegah Ghahremani, Xiaohui Zhang, Vimal Manohar, Chunxi Liu, Aren Jansen and Dietrich Klakow
IEEE Workshop on Spoken Language Technology – 2014

[pdf] | [bib]

@article{trmalkeyword, author = {Jan Trmal and Guoguo Chen and Povey, Daniel and Khudanpur, Sanjeev and Pegah Ghahremani and Xiaohui Zhang and Vimal Manohar and Chunxi Liu and Jansen, Aren and Dietrich Klakow}, title = {A KEYWORD SEARCH SYSTEM USING OPEN SOURCE SOFTWARE}, booktitle = {IEEE Workshop on Spoken Language Technology}, year = {2014} }

Low-Resource Open Vocabulary Keyword Search Using Point Process Models
Chunxi Liu, Aren Jansen, Guoguo Chen, Keith Kintzley, Jan Trmal and Sanjeev Khudanpur
Fifteenth Annual Conference of the International Speech Communication Association – 2014

[pdf] | [bib]

@inproceedings{liu2014low, author = {Chunxi Liu and Jansen, Aren and Guoguo Chen and Keith Kintzley and Jan Trmal and Khudanpur, Sanjeev}, title = {Low-Resource Open Vocabulary Keyword Search Using Point Process Models}, booktitle = {Fifteenth Annual Conference of the International Speech Communication Association}, year = {2014} }

Social Media Analytics for Smart Health
Ahmed Abbasi, Donald Adjeroh, Mark Dredze, Michael Paul, Fatemeh Zahedi, Huimin Zhao, Nitin Walia, Hemant Jain, Patrick Sanvanson, Reza Shaker, Marco Huesch, Richard Beal, Wanhong Zheng, Marie Abate and Arun Ross
IEEE Intelligent Systems – 2014

[bib]

@article{Dredze:2014lq, author = {Ahmed Abbasi and Donald Adjeroh and Dredze, Mark and Michael Paul and Fatemeh Zahedi and Huimin Zhao and Nitin Walia and Hemant Jain and Patrick Sanvanson and Reza Shaker and Marco Huesch and Richard Beal and Wanhong Zheng and Marie Abate and Arun Ross}, title = {Social Media Analytics for Smart Health}, year = {2014}, pages = {60--80}, url = {http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6832891} }

A Test Collection for Email Entity Linking
Ning Gao, Douglas Oard and Mark Dredze
NIPS Workshop on Automated Knowledge Base Construction – 2014

[pdf] | [bib]

@inproceedings{Gao:2014ty, author = {Ning Gao and Douglas Oard and Dredze, Mark}, title = {A Test Collection for Email Entity Linking}, booktitle = {NIPS Workshop on Automated Knowledge Base Construction}, year = {2014} }

Faster (and Better) Entity Linking with Cascades
Adrian Benton, Jay Deyoung, Adam Teichert, Mark Dredze, Benjamin Van Durme, Stephen Mayhew and Karen Daughton-Thomas
NIPS Workshop on Automated Knowledge Base Construction – 2014

[pdf] | [bib]

@inproceedings{Benton:2014qe, author = {Adrian Benton and Jay Deyoung and Adam Teichert and Dredze, Mark and Van Durme, Benjamin and Stephen Mayhew and Daughton-Thomas, Karen}, title = {Faster (and Better) Entity Linking with Cascades}, booktitle = {NIPS Workshop on Automated Knowledge Base Construction}, year = {2014} }

Factor-based Compositional Embedding Models
Mo Yu, Matt Gormley and Mark Dredze
NIPS Workshop on Learning Semantics – 2014

[pdf] | [bib]

@inproceedings{Mo-Yu:2014qv, author = {Mo Yu and Gormley, Matt and Dredze, Mark}, title = {Factor-based Compositional Embedding Models}, booktitle = {NIPS Workshop on Learning Semantics}, year = {2014} }

High Risk Pregnancy Prediction from Clinical Text
Rebecca Knowles, Mark Dredze, Kathleen Evans, Elyse Lasser, Tom Richards, Jonathan Weiner and Hadi Kharrazi
NIPS Workshop on Machine Learning for Clinical Data Analysis – 2014

[bib]

@inproceedings{Knowles:2014ly, author = {Rebecca Knowles and Dredze, Mark and Kathleen Evans and Elyse Lasser and Tom Richards and Jonathan Weiner and Hadi Kharrazi}, title = {High Risk Pregnancy Prediction from Clinical Text}, booktitle = {NIPS Workshop on Machine Learning for Clinical Data Analysis}, year = {2014} }

Twitter Improves Influenza Forecasting
Michael Paul, Mark Dredze and David Broniatowski
PLOS Currents Outbreaks – 2014

[abstract] [bib]

Abstract

Accurate disease forecasts are imperative when preparing for influenza epidemic outbreaks; nevertheless, these forecasts are often limited by the time required to collect new, accurate data. In this paper, we show that data from the microblogging community Twitter significantly improves influenza forecasting. Most prior influenza forecast models are tested against historical influenza-like illness (ILI) data from the U.S. Centers for Disease Control and Prevention (CDC). These data are released with a one-week lag and are often initially inaccurate until the CDC revises them weeks later. Since previous studies utilize the final, revised data in evaluation, their evaluations do not properly determine the effectiveness of forecasting. Our experiments using ILI data available at the time of the forecast show that models incorporating data derived from Twitter can reduce forecasting error by 17-30% over a baseline that only uses historical data. For a given level of accuracy, using Twitter data produces forecasts that are two to four weeks ahead of baseline models. Additionally, we find that models using Twitter data are, on average, better predictors of influenza prevalence than are models using data from Google Flu Trends, the leading web data source.
@article{Paul_Dredze_Broniatowski:2014, author = {Michael Paul and Dredze, Mark and David Broniatowski}, title = {Twitter Improves Influenza Forecasting}, year = {2014}, url = {http://currents.plos.org/outbreaks/article/twitter-improves-influenza-forecasting/}, abstract = {Accurate disease forecasts are imperative when preparing for influenza epidemic outbreaks; nevertheless, these forecasts are often limited by the time required to collect new, accurate data. In this paper, we show that data from the microblogging community Twitter significantly improves influenza forecasting. Most prior influenza forecast models are tested against historical influenza-like illness (ILI) data from the U.S. Centers for Disease Control and Prevention (CDC). These data are released with a one-week lag and are often initially inaccurate until the CDC revises them weeks later. Since previous studies utilize the final, revised data in evaluation, their evaluations do not properly determine the effectiveness of forecasting. Our experiments using ILI data available at the time of the forecast show that models incorporating data derived from Twitter can reduce forecasting error by 17-30% over a baseline that only uses historical data. For a given level of accuracy, using Twitter data produces forecasts that are two to four weeks ahead of baseline models. Additionally, we find that models using Twitter data are, on average, better predictors of influenza prevalence than are models using data from Google Flu Trends, the leading web data source.} }

What Are Health-related Users Tweeting? A Qualitative Content Analysis of Health-related Users and their Messages on Twitter
Joy Lee, Matthew DeCamp, Mark Dredze, Margaret Chisolm and Zackary Berger
Journal of Medical Internet Research (JMIR) – 2014

[bib]

@article{Lee:2014ve, author = {Joy Lee and Matthew DeCamp and Dredze, Mark and Margaret Chisolm and Zackary Berger}, title = {What Are Health-related Users Tweeting? A Qualitative Content Analysis of Health-related Users and their Messages on Twitter}, year = {2014}, url = {http://www.jmir.org/2014/10/e237} }

Discovering Health Topics in Social Media Using Topic Models
Michael Paul and Mark Dredze
PLoS ONE – 2014

[abstract] [bib]

Abstract

By aggregating self-reported health statuses across millions of users, we seek to characterize the variety of health information discussed in Twitter. We describe a topic modeling framework for discovering health topics in Twitter, a social media website. This is an exploratory approach with the goal of understanding what health topics are commonly discussed in social media. This paper describes in detail a statistical topic model created for this purpose, the Ailment Topic Aspect Model (ATAM), as well as our system for filtering general Twitter data based on health keywords and supervised classification. We show how ATAM and other topic models can automatically infer health topics in 144 million Twitter messages from 2011 to 2013. ATAM discovered 13 coherent clusters of Twitter messages, some of which correlate with seasonal influenza (r = 0.689) and allergies (r = 0.810) temporal surveillance data, as well as exercise (r = .534) and obesity (r = −.631) related geographic survey data in the United States. These results demonstrate that it is possible to automatically discover topics that attain statistically significant correlations with ground truth data, despite using minimal human supervision and no historical data to train the model, in contrast to prior work. Additionally, these results demonstrate that a single general-purpose model can identify many different health topics in social media.
@article{Paul:2014rt, author = {Michael Paul and Dredze, Mark}, title = {Discovering Health Topics in Social Media Using Topic Models}, year = {2014}, url = {http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0103408}, abstract = {By aggregating self-reported health statuses across millions of users, we seek to characterize the variety of health information discussed in Twitter. We describe a topic modeling framework for discovering health topics in Twitter, a social media website. This is an exploratory approach with the goal of understanding what health topics are commonly discussed in social media. This paper describes in detail a statistical topic model created for this purpose, the Ailment Topic Aspect Model (ATAM), as well as our system for filtering general Twitter data based on health keywords and supervised classification. We show how ATAM and other topic models can automatically infer health topics in 144 million Twitter messages from 2011 to 2013. ATAM discovered 13 coherent clusters of Twitter messages, some of which correlate with seasonal influenza (r = 0.689) and allergies (r = 0.810) temporal surveillance data, as well as exercise (r = .534) and obesity (r = −.631) related geographic survey data in the United States. These results demonstrate that it is possible to automatically discover topics that attain statistically significant correlations with ground truth data, despite using minimal human supervision and no historical data to train the model, in contrast to prior work. Additionally, these results demonstrate that a single general-purpose model can identify many different health topics in social media.} }

Twitter: Big Data Opportunities (Letter)
David Broniatowski, Michael Paul and Mark Dredze
Science – 2014

[bib]

@article{Broniatowski:2014nr, author = {David Broniatowski and Michael Paul and Dredze, Mark}, title = {Twitter: Big Data Opportunities (Letter)}, year = {2014}, pages = {148}, url = {http://www.sciencemag.org/content/345/6193/148.1.full} }

A Large-Scale Quantitative Analysis of Latent Factors and Sentiment in Online Doctor Reviews
Byron Wallace, Michael Paul, Urmimala Sarkar, Thomas Trikalinos and Mark Dredze
Journal of the American Medical Informatics Association (JAMIA) – 2014

[bib]

@article{Wallace:2014qd, author = {Byron Wallace and Michael Paul and Urmimala Sarkar and Thomas Trikalinos and Dredze, Mark}, title = {A Large-Scale Quantitative Analysis of Latent Factors and Sentiment in Online Doctor Reviews}, year = {2014}, url = {http://jamia.bmj.com/content/early/2014/06/10/amiajnl-2014-002711.full} }

HealthTweets.org: A Platform for Public Health Surveillance using Twitter
Mark Dredze, Renyuan Cheng, Michael Paul and David Broniatowski
AAAI Workshop on the World Wide Web and Public Health Intelligence – 2014

[abstract] [bib]

Abstract

We present HealthTweets.org, a new platform for sharing the latest research results on Twitter data with researchers and public officials. In this demo paper, we describe data collection, processing, and features of the site. The goal of this service is to transition results from research to practice.
@inproceedings{Dredze:2014fk, author = {Dredze, Mark and Renyuan Cheng and Michael Paul and David Broniatowski}, title = {HealthTweets.org: A Platform for Public Health Surveillance using Twitter}, booktitle = {AAAI Workshop on the World Wide Web and Public Health Intelligence}, year = {2014}, url = {http://www.aaai.org/ocs/index.php/WS/AAAIW14/paper/download/8723/8218}, abstract = {We present HealthTweets.org, a new platform for sharing the latest research results on Twitter data with researchers and public officials. In this demo paper, we describe data collection, processing, and features of the site. The goal of this service is to transition results from research to practice.} }

Challenges in Influenza Forecasting and Opportunities for Social Media
Michael Paul, Mark Dredze and David Broniatowski
AAAI Workshop on the World Wide Web and Public Health Intelligence – 2014

[bib]

@inproceedings{paul_dredze_aaai:14, author = {Michael Paul and Dredze, Mark and David Broniatowski}, title = {Challenges in Influenza Forecasting and Opportunities for Social Media}, booktitle = {AAAI Workshop on the World Wide Web and Public Health Intelligence}, year = {2014} }

Exploring Health Topics in Chinese Social Media: An Analysis of Sina Weibo
Shiliang Wang, Michael Paul and Mark Dredze
AAAI Workshop on the World Wide Web and Public Health Intelligence – 2014

[abstract] [bib]

Abstract

This paper seeks to identify and characterize health-related topics discussed on the Chinese microblogging website, Sina Weibo. We identified nearly 1 million messages containing health-related keywords, filtered from a dataset of 93 million messages spanning five years. We applied probabilistic topic models to this dataset and identified the prominent health topics. We show that a variety of health topics are discussed in Sina Weibo, and that four flu-related topics are correlated with monthly influenza case rates in China.
@inproceedings{Wang:2014fk, author = {Shiliang Wang and Michael Paul and Dredze, Mark}, title = {Exploring Health Topics in Chinese Social Media: An Analysis of Sina Weibo}, booktitle = {AAAI Workshop on the World Wide Web and Public Health Intelligence}, year = {2014}, url = {http://www.aaai.org/ocs/index.php/WS/AAAIW14/paper/download/8721/8222}, abstract = {This paper seeks to identify and characterize health-related topics discussed on the Chinese microblogging website, Sina Weibo. We identified nearly 1 million messages containing health-related keywords, filtered from a dataset of 93 million messages spanning five years. We applied probabilistic topic models to this dataset and identified the prominent health topics. We show that a variety of health topics are discussed in Sina Weibo, and that four flu-related topics are correlated with monthly influenza case rates in China.} }

Concretely Annotated Corpora
Francis Ferraro, Max Thomas, Matt Gormley, Travis Wolfe, Craig Harman and Benjamin Van Durme
4th Workshop on Automated Knowledge Base Construction (AKBC) – 2014

[pdf] | [bib]

@inproceedings{concretely-annotated-2014, author = {Francis Ferraro and Max Thomas and Gormley, Matt and Wolfe, Travis and Harman, Craig and Van Durme, Benjamin}, title = {Concretely Annotated Corpora}, booktitle = {4th Workshop on Automated Knowledge Base Construction (AKBC)}, year = {2014} }

Seeded graph matching for correlated Erdos-Renyi graphs
Vince Lyzinski, D. Fishkind and Carey Priebe
Journal of Machine Learning Research – 2014

[bib]

@article{JMLR:v15:lyzinski14a, author = {Lyzinski, Vince and D. Fishkind and Priebe, Carey}, title = {Seeded graph matching for correlated Erdos-Renyi graphs}, year = {2014}, pages = {3513-3540}, url = {http://jmlr.org/papers/v15/lyzinski14a.html} }

A limit theorem for scaled eigenvectors of random dot product graphs
A Athreya, Vince Lyzinski, D. J. Marchette, Carey Priebe, D. L. Sussman and M. Tang
Sankhya A – 2014

[pdf] | [bib]

@article{athreya13, author = {A Athreya and Lyzinski, Vince and D. J. Marchette and Priebe, Carey and D. L. Sussman and M. Tang}, title = {A limit theorem for scaled eigenvectors of random dot product graphs}, year = {2014}, publisher = {Springer} }

Perfect clustering for stochastic blockmodel graphs via adjacency spectral embedding
Vince Lyzinski, D. Sussman, M. Tang, A. Athreya and Carey Priebe
Electronic Journal of Statistics – 2014

[pdf] | [bib]

@article{perfect, author = {Lyzinski, Vince and D. Sussman and M. Tang and A. Athreya and Priebe, Carey}, title = {Perfect clustering for stochastic blockmodel graphs via adjacency spectral embedding}, year = {2014}, publisher = {Institute of Mathematical Statistics} }

Back to Top

2013 (75 total)

Perceptual Properties of Current Speech Recognition Technology
Hynek Hermansky, Jordan R. Cohen and Richard M. Stern
2013

[abstract] [pdf] | [bib]

Abstract

In recent years, a number of feature extraction procedures for automatic speech recognition (ASR) systems have been based on models of human auditory processing, and one often hears arguments in favor of implementing knowledge of human auditory perception and cognition into machines for ASR. This paper takes a reverse route, and argues that the engineering techniques for automatic recognition of speech that are already in widespread use are often consistent with some well-known properties of the human auditory system.
@{, author = {Hermansky, Hynek and Jordan R. Cohen and Richard M. Stern}, title = {Perceptual Properties of Current Speech Recognition Technology}, month = {September}, year = {2013}, publisher = {IEEE}, pages = {1968 - 1985}, url = {http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6566018}, abstract = {In recent years, a number of feature extraction procedures for automatic speech recognition (ASR) systems have been based on models of human auditory processing, and one often hears arguments in favor of implementing knowledge of human auditory perception and cognition into machines for ASR. This paper takes a reverse route, and argues that the engineering techniques for automatic recognition of speech that are already in widespread use are often consistent with some well-known properties of the human auditory system.} }

Back to Top

Displaying 1 - 100 of 2115 total matches