Talks

AI Alignment with Changing and Influenceable Reward Functions

Micah Carroll - University of California, Berkeley

IRB-5105 Brendan Iribe Center for Computer Science and Engineering (IRB)

Tuesday, April 30, 2024, 6:30-7:30 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Registration requested: The organizer of this talk requests that you register if you are planning to attend. There are two ways to register: (1) You can create an account on this site (click the "register" link in the upper-right corner) and then register for this talk; or (2) You can enter your details below and click the "Register for talk" button. Either way, you can always cancel your registration later.

Abstract

Current AI alignment techniques treat human preferences as static and model them via a single reward function. However, our preferences change, making the goal of alignment ambiguous: should AI systems act in the interest of our current, past, or future selves? The behavior of AI systems may also influence our preferences, meaning that notions of alignment must also specify which kinds of influence are––and are not––acceptable. The answers to these questions are left undetermined by the current AI alignment paradigm, making it ill-posed. To ground formal discussions of these issues, we introduce Dynamic Reward MDPs (DR-MDPs), which extend MDPs to allow for the reward function to change and be influenced by the agent. Using the lens of DR-MDPs, we demonstrate that agents resulting from current alignment techniques will have incentives for influence––that is, they will systematically attempt to shift our future preferences to make them easier to satisfy. We also investigate how one may avoid undesirable influence by leveraging the optimization horizon used or by using different DR-MDP optimization objectives which correspond to alternative notions of alignment. Broadly, our work highlights the unintended consequences of applying current alignment techniques to settings with changing and influenceable preferences, and describes the challenges that must be overcome to develop a more general AI alignment paradigm which can accommodate such settings.

Bio

Micah Carroll is an AI PhD student at UC Berkeley advised by Professors Anca Dragan and Stuart Russell. Originally from Italy, Micah graduated with a Bachelor’s in Statistics from Berkeley in 2019. He has worked at Microsoft Research and at the Center for Human-Compatible AI (CHAI). His research interests lie in human-AI systems: in particular measuring the effects of social media on users, and improving techniques for human modeling and human-AI collaboration. You can find him on his website or on Twitter.

Note: Please register using the Google Form on our website https://go.umd.edu/marl for access to the Google Meet, Open-source Multi-Agent AI Research Community and talk resources.

This talk is organized by Saptarashmi Bandyopadhyay