PhD Proposal: Disentangling Internal Factors and Mechanisms for Interpreting and Steering Language Model Behavior
Large language models (LLMs) exhibit increasingly capable behavior across a wide range of tasks, yet our understanding of why these models behave as they do, where they might fail, and if they fail, what are the underlying causes, remains limited. This gap between performance and understanding poses concerns, not only for diagnosing model failures, but also for reliably intervening on model behavior. Without understanding how internal representations mediate behavior, modifying models through fine-tuning, prompt engineering, or post-training can be imprecise, wherein improvements along one axis may inadvertently degrade performance or robustness along others.
This thesis investigates the internal factors and mechanisms in LLMs that mediate behavior and how these factors can be interpreted, evaluated, and manipulated in a precise manner. A central challenge addressed in this work is that LLM representations often encode multiple overlapping and correlated signals, making it difficult to attribute behavior to specific factors or to intervene on them without inducing unintended side effects. Crucially, current interpretability and intervention methods rely on narrow evaluation settings that do not test whether the identified representations are specific to the target property and stable across contexts. As a result, it is often unclear whether an identified feature or direction reflects robust internal mechanisms or a heuristic specific to a particular dataset and evaluation setting.
This thesis places the disentanglement of internal factors and mechanisms at the center of the design and evaluation of interpretability and intervention methods. We make four core contributions. First, we develop methods for disentangling what is encoded in LLM representations, enabling more targeted interpretability and model steering. Second, we introduce evaluation frameworks that probe the specificity and robustness of interventions beyond in-distribution settings. Third, we study the interactions between tightly coupled internal mechanisms, such as alignment and jailbreaking, investigating how alignment suppresses or redirects unsafe behaviors and how jailbreaking circumvents these safeguards. Finally, we establish conceptual and empirical foundations for understanding why certain representational structures emerge in LLMs, examining how correlations in training data and training regimes shape both what information is encoded internally and how it is used to generate behavior.
Collectively, these contributions aim to advance a causal understanding of LLM representations, characterizing what internal representations capture, why they emerge, how they interact, and how they can be reliably leveraged for precise, robust, and aligned model behavior.