PhD Proposal: Beyond Static Tests of Machine Intelligence: Interruptible and Interactive Evaluation of Knowledge

Talk

Pedro Rodriguez

Time:

08.19.2020 15:00 to 17:00

Location:

Remote

URL:

https://talks.cs.umd.edu/talks/2626

As humans, we learn about the world by asking questions and test our knowledge by answering questions. These abilities combine aspects of intelligence unique to humans like language, knowledge representation, and reasoning. Thus, building systems capable of human-like question answering (qa) is a grand goal of natural language processing and equivalent in ambition to achieving general artificial intelligence. In pursuit of this goal, progress in qa and most of machine learning is measured by issuing “exams” to computer systems and comparing their performance to that of a typical human. Occasionally, these tests take the form of public exhibition matches like when ibm Watson defeated the best trivia players in the world and when the system we describe in this proposal likewise defeated decorated trivia players. At the same time, it is clear that modern systems—ours included— are sophisticated pattern matchers. Paradoxically, although our “tests” suggest that machines have surpassed humans, that qa algorithms are based on pattern matching strongly implies that they do not possess human-like qa skills.One cause of this paradox is that the formats and data used in benchmark evaluations are easily gamed by machines. In this proposal we show two ways that machines unfairly benefit from these benchmarks: (1) the format of evaluation fails to discriminate knowledge with sufficient granularity, and (2) evaluation data contains patterns easily exploited by pattern-matching models. For example, in Jeopardy! knowledge of both players is checked at one point—the end of the question—so knowing the answer earlier is not rewarded. In the first part of this proposal, we introduce an interruptible trivia game—Quizbowl—that incrementally checks knowledge and thus better determines which player knows more. However, this does not address that simple and brittle pattern matching models best highly accomplished Quizbowl players. The next part of my proposal describes an interactively constructed dataset of adversarial questions that—by construction—are difficult to answer by pattern matching alone. The incremental and interruptible format combined with adversarially written questions more equitably compares machine qa models to humans.In the final chapter, we introduce two proposed works that aim to improve evaluations on tasks beyond interruptible trivia games. First, we empirically compute the capacity of qa benchmarks to discriminate between two agents as task performance approaches the annotation noise upper bound. Second, we build on recent work in interactive information-seeking and introduce interruptible evaluations in reading comprehension benchmarks. The shared goal of all these works is to improve qa evaluation formats and the data used in these evaluations.Examining Committee:

Chair: Dr. Jordan Boyd-Graber Dept rep: Dr. Douglas W. Oard Members: Dr. Leilani Battle

Upcoming Events

Event

04.26.2024 12:00 to 13:30

IRB-4105

Computer Science APT Meeting

Event

04.26.2024 13:00 to 14:00

IRB-5105

Computer Science Instructional Faculty Meeting

Talk

04.26.2024 13:30 to 15:00

ATL 3100A

PhD Proposal: Towards the Verification of Quantum Networks
Yusuf Alnawakhtha

Event

04.26.2024 15:00 to 16:30

IRB-0318

Computer Science Education Committee Meeting

Talk

04.29.2024 11:30 to 12:30

IRB 4107

PhD Proposal: Multi-Agent Autonomous Decision Making in Artificial Intelligence
Saptarashmi Bandyopadhyay

Talk

04.29.2024 15:00 to 16:00

IRB 5105

PhD Proposal: Scaling Policy Gradient Methods to Open-Ended Domains
Ryan Sullivan

Talk

04.30.2024 10:00 to 12:00

IRB 4105

AI Empowered Music Education
Snehesh Shrestha

Talk

04.30.2024 12:30 to 15:00

IRB 4107

Towards Trustworthy Models in Machine Learning
Xiaoyu Liu

Talk

05.01.2024 15:00 to 17:00

IRB IRB-4105

PhD Defense: Feedback for Vision
Michael Maynord

Talk

05.02.2024 12:30 to 14:00

IRB 4107

Towards AI Alignment: Advancing Fairness, Reliability, and Human-Like Perception in AI
Bang An