PhD Proposal: Obtaining Fine-Grained Control over Large Language Models

Sicheng Zhu
05.15.2024 13:00 to 14:30

IRB IRB-5105

Aligned large language models (LLMs) can largely follow people's verbal prompts to generate the desired content. However, control objectives cannot always be described verbally, such as certain reward functions, and LLMs' ability to follow instructions is imperfect, as demonstrated by hallucinations and jailbreak vulnerabilities. In this case, fine-grained control of LLMs can achieve non-verbal objectives to accomplish more tasks and enhance the models' abilities to follow instructions and reason, ultimately maximizing their utility to humans.

In this proposal, we introduce a framework for fine-grained control, and instantiate two simple implementations to demonstrate its potential. First, we provide a more unified understanding of controllable generation for LLMs from a sampling perspective, from which we derive a general framework for fine-grained controllable generation. Then, we instantiate the framework on two specific objectives: generating coherent jailbreak prompts and generating harmless prompts that trigger LLMs' false refusals. We also demonstrate on a classification task that allowing models to refine its outputs has the potential to enhance reasoning capabilities. Finally, we outline the steps required to fully implement this framework, aiming at ultimately achieving fine-grained control over LLMs.