Interpretability as the Inverse Machine Learning Pipeline

Talk
Sarah Wiegreffe
Time: 
11.14.2025 11:00 to 12:00

Language models (LMs) power a rapidly-growing and increasingly impactful suite of AI technologies. However, due to their scale and complexity, we lack a fundamental scientific understanding of much of LLMs’ behavior, even when they are open source. In this talk, I will describe some of our recent work on interpreting LMs through the lens of the classical machine learning pipeline. This includes 1) working backwards from behavioral analysis and explanation generation as a form of model evaluation, 2) interpreting model internals post-training, 3) understanding model training dynamics, and ultimately 4) attributing model behavior back to the training data, with the goal to build better training corpora for future LMs.