PhD Proposal: Graded Judgments of Plausibility in Commonsense Reasoning

Talk
Shramay Palta
Time: 
05.13.2025 10:00 to 12:00
Location: 

Zoom: https://umd.zoom.us/j/93465250117
Commonsense reasoning about day-to-day situations often involves making soft judgments about the relative "likelihood" or "plausibility" of various possible outcomes. In sharp contrast to factoid question answering or mathematical reasoning, commonsense reasoning, in response to an everyday situation, admits multiple plausible answers. One of the reasons for this phenomenon is the inherent uncertainty surrounding everyday situations, due to cultural or contextual differences, for instance. This graded nature of commonsense reasoning can cause multiple possible answers to contend for being the "most plausible" answer to a question.
Researchers have widely adopted the Multiple Choice Question (MCQ) format to evaluate commonsense knowledge acquisition in LLMs, which has proven effective across a wide range of benchmarks like CommonsenseQA, Physical Interaction: Question Answering, and Social IQa. The benefits of using this format are clear: with a single correct answer, model scores on the task are easy to compute and understand. This property makes MCQs an attractive choice for evaluating situations with an objectively correct answer, such as factoid question answering or mathematical reasoning. However, given the uncertain nature of commonsense reasoning where multiple answers may seem likely, the assumption of a uniquely correct answer can limit our understanding of the commonsense reasoning capabilities of LLMs.
In this thesis proposal, we explore how the graded nature of commonsense reasoning can impede with effective LLM evaluation, causes shifts in human and model beliefs, and show downstream impacts like unintended biases.
First, we introduce a new plausibility-rating framework, where we rate the plausibility of individual answer choices of questions from two popular commonsense reasoning benchmarks. Through this plausibility rating procedure, we observe that, in over 20% of the cases, the dataset gold-label and the most plausible answer choice do not align. A manual analysis of this subset also reveals issues like ambiguity and answer choices that do not fit the question. We also show that MCQs in which the difference in mean plausibility scores between the most plausible and second-most plausible answer choices is small are more likely to exhibit low agreement when human annotators select the best answer choice.
Second, we study the impact of the LLM-generated rationales on human notions of plausibility about day-to-day commonsense reasoning situations. By presenting plausibility arguments, both in favor and against an answer choice from a commonsense reasoning MCQ, we highlight that both human and LLM plausibility judgments are significantly impacted by the inclusion of the rationales, relative to their absence. Our findings highlight the potentially persuasive nature of LLM-generated rationales and the extent of their creativity in justifying how a situation might be plausible.
Finally, to understand the downstream implications of these uncertainties, we introduce a new dataset, FORK, a dataset consisting of questions around culinary cultures and customs, to show that language models exhibit systematic cultural biases favoring US over non-US cultures. Curated in the MCQ format, we run a systematic evaluation of several encoder-based models to highlight how these models favor US cultures over non-US ones.
Having demonstrated how uncertainty impacts commonsense reasoning evaluation of LLMs and can cause unintended effects, our proposed work aims to further investigate systematic variations in LLM plausibility judgments across different rating techniques. We further seek to understand where these inconsistencies align with any known human cognitive biases, such as framing effects and anchoring. Finally, we propose to train a preference model using DPO to better align with human plausibility judgments, leveraging both LLM and human-validated comparisons.