PhD Proposal: On the Robustness of Large Lange Model pipelines
Large Language Models (LLMs) and retrieval-augmented models form the core of many modern AI systems. As these systems are increasingly integrated into downstream applications, ensuring the robustness of the entire pipeline from training to deployment has become critical. Real-world deployment exposes these models to a range of vulnerabilities, including data poisoning, model degradation, and performance slowdowns. Addressing these challenges is essential to maintain the reliability, security, and efficiency of LLM-powered systems in production environments. This proposal focuses on addressing certain aspects of the vulnerabilities in these systems.
The first part of the proposal addresses training-time vulnerabilities in the LLM alignment stage, particularly those related to backdoor attacks. This is explored through two key contributions. The first analyzes the vulnerabilities of reinforcement learning with human feedback (RLHF), with a specific focus on Direct Preference Optimization (DPO), during the fine-tuning stage of LLMs on both white-box and black-box attack settings. The second contribution introduces a novel attack method, AdvBDGen, which eliminates the need for manual backdoor design.
The second part of the proposal focuses on the robustness of reward models used in the alignment phase. Reward models trained using human preference pairs often experience significant performance degradation on out-of-distribution (OOD) samples, a phenomenon commonly referred to as reward hacking. Prior works have attempted to mitigate this issue through heuristic-based approaches. In contrast, this proposal introduces REFORM, an automated framework that leverages controlled decoding in conjunction with an imperfect reward model to generate class-appropriate OOD samples that expose reward model instabilities.
Finally, the proposal presents a defense mechanism against inference-time threats faced by retrieval systems in LLM pipelines, where we introduce a partition-and-aggregation style defense mechanism to mitigate attacks during the retrieval stage without incurring the computational overhead of retrieval augmented generation. Taken together, this proposal establishes a foundation for future works aimed at reducing the computational cost of retrievers, improving their performance, and enhancing the overall robustness of LLM pipelines against both model degradation and inference-time slowdowns.