PhD Proposal: Beyond Static Tests of Machine Intelligence: Interruptible and Interactive Evaluation of Knowledge

Pedro Rodriguez
08.19.2020 15:00 to 17:00


As humans, we learn about the world by asking questions and test our knowledge by answering questions. These abilities combine aspects of intelligence unique to humans like language, knowledge representation, and reasoning. Thus, building systems capable of human-like question answering (qa) is a grand goal of natural language processing and equivalent in ambition to achieving general artificial intelligence. In pursuit of this goal, progress in qa and most of machine learning is measured by issuing “exams” to computer systems and comparing their performance to that of a typical human. Occasionally, these tests take the form of public exhibition matches like when ibm Watson defeated the best trivia players in the world and when the system we describe in this proposal likewise defeated decorated trivia players. At the same time, it is clear that modern systems—ours included— are sophisticated pattern matchers. Paradoxically, although our “tests” suggest that machines have surpassed humans, that qa algorithms are based on pattern matching strongly implies that they do not possess human-like qa skills.One cause of this paradox is that the formats and data used in benchmark evaluations are easily gamed by machines. In this proposal we show two ways that machines unfairly benefit from these benchmarks: (1) the format of evaluation fails to discriminate knowledge with sufficient granularity, and (2) evaluation data contains patterns easily exploited by pattern-matching models. For example, in Jeopardy! knowledge of both players is checked at one point—the end of the question—so knowing the answer earlier is not rewarded. In the first part of this proposal, we introduce an interruptible trivia game—Quizbowl—that incrementally checks knowledge and thus better determines which player knows more. However, this does not address that simple and brittle pattern matching models best highly accomplished Quizbowl players. The next part of my proposal describes an interactively constructed dataset of adversarial questions that—by construction—are difficult to answer by pattern matching alone. The incremental and interruptible format combined with adversarially written questions more equitably compares machine qa models to humans.In the final chapter, we introduce two proposed works that aim to improve evaluations on tasks beyond interruptible trivia games. First, we empirically compute the capacity of qa benchmarks to discriminate between two agents as task performance approaches the annotation noise upper bound. Second, we build on recent work in interactive information-seeking and introduce interruptible evaluations in reading comprehension benchmarks. The shared goal of all these works is to improve qa evaluation formats and the data used in these evaluations.Examining Committee:

Chair: Dr. Jordan Boyd-Graber Dept rep: Dr. Douglas W. Oard Members: Dr. Leilani Battle