PhD Defense: From Measurement to Discovery: Comparing Text in Computational Social Science

Talk
Paiheng Xu
Time: 
02.04.2026 10:30 to 12:00

Computational social science increasingly uses text as evidence to understand social phenomena, relying on natural language processing (NLP) methods to support both measurement and discovery. A recurring form of question is comparative: how does language vary across groups of texts, where groups may be defined by populations, time periods, or concept-driven partitions? Answering such questions requires methodological decisions that connect technical choices to social constructs and theory. In measurement, the goal is to convert text into quantitative representations of well-specified constructs to support hypothesis testing. In discovery, the goal is to surface candidate patterns from large corpora and decide which differences are meaningful.
My first completed study exemplifies the measurement mode, investigating how well current NLP methods can measure high-inference instructional quality. Using well-established education rubrics, we evaluate pretrained language models (PLMs) on math instruction across two settings: K–12 classroom transcripts and simulated teaching tasks for pre-service teachers. The results reveal a construct-linked pattern: models perform best on variables requiring less pedagogical inference and struggle with variables demanding deeper interpretation—mirroring where human raters also disagree more. This work demonstrates how measurement choices must be informed by both theoretical constructs and practical constraints in data collection.
My second completed study centers on discovery, analyzing how geographic co-location shapes public health conversations during COVID-19. We operationalize co-location between public health experts and participants, model engagement patterns, and characterize linguistic differences that are associated with higher engagement. The findings show that co-located conversations generate higher engagement, are especially associated with sharing personal experiences, and are more positive and personal when PHEs share personal experiences or feelings. However, this study also reveals a fundamental limitation: traditional text analysis tools require researchers to specify features up front, constraining the hypothesis space to what those tools can represent.
My proposed work addresses this constraint through a framework for conditional hypothesis generation using large language models (LLMs). While LLMs enable discovery by generating natural language hypotheses about how corpora differ, methods that prioritize separability risk surfacing patterns driven by correlated factors rather than the comparison of interest. The proposed framework steers the direction of discovery by incorporating researcher-specified covariates, generating conditional differences that remain informative after accounting for relevant contextual factors. This work will provide both controlled evaluations using synthetic datasets and real-world case studies demonstrating how conditioning shapes the patterns discovered and supports more principled dataset explanation.