PhD Defense: Gathering Language Data Using Experts

Talk
Denis Peskov
Time: 
12.16.2021 15:00 to 17:00
Location: 

IRB 4105

Natural language processing needs substantial data to make robust predictions. Automatic methods, unspecialized crowds, and domain experts can be used to collect conversational and question answering nlp datasets. A hybrid solution of combining domain experts with the crowd generates large-scale, free-form language data.A low-cost, high-output approach to data creation is automation. We create and analyze a large-scale audio question answering dataset through text-to-speech technology. Additionally, we create synthetic data from templates to identify limitations in machine translation. We conclude that the cost-savings and scalability of automation come at the cost of data quality and naturalness.Human input can provide this degree of naturalness, but is limited in scale. Hence, large-scale data collection is frequently done through crowd-sourcing. A question-rewriting task, in which a long information-gathering conversation is used as source material for many stand-alone questions, shows the limitation of using this methodology for generating data. Certain users provide low-quality rewrites— removing words from the question, copy and pasting the answer into the question—if left unsupervised. We automatically prevent unsatisfactory submissions with an interface, but the quality control process requires manually reviewing 5,000 questions.Therefore, we posit that using domain experts for data generation can create novel and reliable nlp datasets. First, we introduce computational adaptation, which adapts, rather than translates, entities across cultures. We work with native speakers in two countries to generate the data, since the gold label for this is subjective and paramount. Furthermore, we hire professional translators to assess our data. Last, in a study on the game of Diplomacy, community members generate a corpus of 17,000 messages that are self-annotated while playing a game about trust and deception. The language is varied in length, tone, vocabulary, punctuation, and even emojis. Additionally, we create a real-time self-annotation system that annotates deception in a manner not possible through crowd-sourced or automatic methods. The extra effort in data collection will hopefully ensure the longevity of these datasets and galvanize other novel nlp ideas.However, experts are expensive and limited in number. Hybrid solutions pair potentially unreliable and unverified users in the crowd with experts. We work with Amazon customer service agents to generate and annotate of goal-oriented 81,000 conversations across six domains. Grounding the conversation with a reliable conversationalist—the Amazon agent—creates free-form conversations; using the crowd scales these to the size needed for neural networks.Examining Committee:

Chair:Dean's Representative:Members:

Dr. Jordan Boyd-Graber Dr. Philip Resnik Dr. Michelle Mazurek Dr. Katie Shilton Dr. John Dickerson