Removing the Training Wheels: A Coreference Dataset that Entertains Humans and Challenges Computers
Anupam Guha, Mohit Iyyer, Danny Bouman, Jordan Boyd-Graber
In proceedings of NAACL 2015, download PDF here

Coreference is a core NLP problem. However, newswire data, the primary source of existing coreference data, lack the richness necessary to truly solve coreference. We present a new domain with denser references---quiz bowl questions---that is challenging and enjoyable to humans, and we use the quiz bowl community to develop a new coreference dataset, together with an annotation framework that can tag any text data with coreferences and named entities. We also successfully integrate active learning into this annotation pipeline to collect documents maximally useful to coreference models. State-of-the-art coreference systems underperform a simple classifier on our new dataset, motivating non-newswire data for future coreference research.

CODE/DATA: The dataset (400 files with coref labels in CoNLL format, both with gold mentions and CRF mentions) can be downloaded here (~21MB) along with the code.

Please email any questions about the code or data to

We thank the anonymous reviewers for their insightful comments. We also thank Dr. Hal Daumé III and the members of the "feetthinking" research group for their advice and assistance. We also thank Dr. Yuening Hu and Mossaab Bagdouri for their help in reviewing the draft of this paper. This work was supported by NSF Grant IIS-1320538. Boyd-Graber is also supported by NSF Grants CCF-1018625 and NCSE-1422492. Any opinions, findings, results, or recommendations expressed here are of the authors and do not necessarily reflect the view of the sponsor.

Our work with quiz bowl coreference data indicates that performances of models trained for coreference resolution varies when the data they are trained on differ sufficiently in genre to the data in which coreference is to be detected. So we try to build models on the various genres present in OntoNotes. We used OntoNotes 5.0, as opposed to 4.0 which we had used while writing our paper. Not only is version 5.0 newer, it has certain genres of coreference which weren't present in earlier versions.

The report for the results of this experiment can be found here.

	Title = {Removing the Training Wheels: A Coreference Dataset that Entertains Humans and Challenges Computers},
	Author = {Anupam Guha and Mohit Iyyer and Danny Bouman and Jordan Boyd-Graber},
	Booktitle = {North American Association for Computational Linguistics},
	Year = {2015},
	Location = {Denver, Colorado}