PhD Proposal: Disentangling Internal Factors and Mechanisms for Interpreting and Steering Language Model Behavior

Talk

Navita Goyal

Time:

01.30.2026 13:00 to 14:30

Location:

REMOTE https://umd.zoom.us/my/navita?pwd=N0FlbzJVUGQ0RzRLZ1F3R0NCOXdmdz09

URL:

https://talks.cs.umd.edu/talks/4476

Large language models (LLMs) exhibit increasingly capable behavior across a wide range of tasks, yet our understanding of why these models behave as they do, where they might fail, and if they fail, what are the underlying causes, remains limited. This gap between performance and understanding poses concerns, not only for diagnosing model failures, but also for reliably intervening on model behavior. Without understanding how internal representations mediate behavior, modifying models through fine-tuning, prompt engineering, or post-training can be imprecise, wherein improvements along one axis may inadvertently degrade performance or robustness along others.
This thesis investigates the internal factors and mechanisms in LLMs that mediate behavior and how these factors can be interpreted, evaluated, and manipulated in a precise manner. A central challenge addressed in this work is that LLM representations often encode multiple overlapping and correlated signals, making it difficult to attribute behavior to specific factors or to intervene on them without inducing unintended side effects. Crucially, current interpretability and intervention methods rely on narrow evaluation settings that do not test whether the identified representations are specific to the target property and stable across contexts. As a result, it is often unclear whether an identified feature or direction reflects robust internal mechanisms or a heuristic specific to a particular dataset and evaluation setting.
This thesis places the disentanglement of internal factors and mechanisms at the center of the design and evaluation of interpretability and intervention methods. We make four core contributions. First, we develop methods for disentangling what is encoded in LLM representations, enabling more targeted interpretability and model steering. Second, we introduce evaluation frameworks that probe the specificity and robustness of interventions beyond in-distribution settings. Third, we study the interactions between tightly coupled internal mechanisms, such as alignment and jailbreaking, investigating how alignment suppresses or redirects unsafe behaviors and how jailbreaking circumvents these safeguards. Finally, we establish conceptual and empirical foundations for understanding why certain representational structures emerge in LLMs, examining how correlations in training data and training regimes shape both what information is encoded internally and how it is used to generate behavior.
Collectively, these contributions aim to advance a causal understanding of LLM representations, characterizing what internal representations capture, why they emerge, how they interact, and how they can be reliably leveraged for precise, robust, and aligned model behavior.

Upcoming Events

Talk

02.09.2026 11:00 to 13:00

IRB-4105

PhD Defense: Computational Geometry in the Funk and Hilbert Metrics
Auguste Gezalyan

Event

02.13.2026 12:00 to 13:00

5165 IRB

Computer Science Instructional Faculty Meeting

Event

02.13.2026 14:00 to 15:00

4105 IRB

Computer Science APT Meeting

Event

02.18.2026 12:00 to 13:00

4105 IRB

Computer Science Assistant Professor Meeting

Event

02.20.2026 14:00 to 15:00

4105 IRB

Computer Science APT Meeting

Event

02.27.2026 12:00 to 13:30

4105 IRB

Computer Science FFL

Event

03.06.2026 12:00 to 13:00

4105 IRB

Computer Science Instructional Faculty Meeting

Event

03.06.2026 14:00 to 15:00

4105 IRB

Computer Science APT Meeting

Event

03.06.2026 15:00 to 16:30

0318 IRB

Education Committee Meeting - IRB 0318

Event

03.13.2026 14:00 to 15:00

4105 IRB

Computer Science APT Meeting