PhD Defense: Principled Frameworks for AI Alignment: From Post-Training to Inference

Talk
Souradip Chakraborty
Time: 
04.15.2026 09:00 to 11:00
Location: 

Artificial intelligence (AI) is increasingly being deployed in high-stakes settings such as healthcare, robotics, defense, and law. As these systems become more capable and autonomous, it becomes essential to ensure that their behavior is aligned with human preferences. This challenge has made AI alignment a central problem in modern AI.
AI alignment can be broadly achieved through two fundamentally different paradigms: the first is i. Post-training alignment, where model parameters are updated after pretraining to better reflect desired behaviors; the second is ii. Inference-time alignment, where model behavior is steered at test time without modifying model parameters. While both paradigms aim to improve alignment, they present distinct challenges and opportunities.This thesis advances AI alignment across both of these directions and is organized into two main parts:
Part I: Post-training AI Alignment :This part focuses on improving alignment through parameter updates. Part I: Post-training AI Alignment: This part focuses on improving alignment through parameter updates. In particular, it addresses fundamental challenges in online alignment with human feedback, as well as the limitations of existing formulations in capturing diverse and conflicting human preferences. I. Distributional Mismatch in Online Alignment - We first identify a key limitation in RLHF - its inability to capture the entanglement between reward learning and policy optimization, leading to distribution shift and suboptimal alignment - and propose a novel bilevel alignment framework that explicitly models this interdependence, enabling more stable and theoretically grounded learning. Pluralistic Alignment with Diverse Preferences - We then study the problem of pluralistic alignment, showing that single-utility RLHF is fundamentally insufficient to represent diverse and conflicting preferences. To address this, we introduce MaxMin RLHF, inspired by principles from social choice theory, which ensures more equitable alignment across users. Together, these contributions provide a principled foundation for robust and inclusive post-training alignment.
Part 2: Inference-time AI Alignment :In contrast to post-training methods, inference-time alignment enables flexible and efficient adaptation by directly steering the generation process at test time without updating model parameters, allowing for real-time personalization at low cost. This part develops a unified framework for controlling model behavior during decoding to achieve both efficiency and robustness. We first introduce Transfer Q*, a principled controlled decoding algorithm that leverages aligned base models to estimate optimal value functions for new tasks, enabling provably efficient and high-quality alignment. Building on this, we propose IMMUNE, which incorporates safety constraints directly into the decoding process to defend against jailbreak and adversarial prompts while preserving user intent. We further extend this paradigm to a multi-agent setting, where a mixture of specialized agents is coordinated via an implicit Q-function to enable adaptive policy switching and improved performance across diverse tasks. Finally, we move beyond standard instruction-tuned models to large reasoning models (LRMs) and investigate efficient test-time scaling strategies for improving their performance.
Together, these two parts provide a unified view of AI alignment: one direction improves alignment by modifying model parameters, while the other improves alignment by controlling how models are used at inference time. Across both settings, this thesis develops principled algorithms, theoretical insights, and practical methods for building AI systems that are more robust, adaptive, safe, and aligned with human goals