PhD Defense: Towards Private and Trustworthy Generative Models

Talk

Yuxin Wen

Time:

11.05.2025 14:30 to 16:00

Location:

IRB-5137 https://umd.zoom.us/j/4139538360?pwd=gb0KNkfRLeWiiDGqobcr4aPEeX0cuY.1&om... (Passcode: umd)

URL:

https://talks.cs.umd.edu/talks/4390

The rapid advancement of Generative AI has achieved transformative capabilities, yet it also introduces critical challenges to security and trustworthiness. As these models grow more powerful, ensuring their responsible deployment demands multifaceted solutions. In this proposal, we address three key dimensions of this challenge: 1) attributing generated content through robust, imperceptible watermarks, 2) revealing adversarial vulnerabilities via automatic red-teaming, and 3) detecting and mitigating unintended memorization.
First, we introduce Tree-Ring Watermarks, a method for embedding invisible yet robust fingerprints into diffusion-generated images, enabling reliable provenance tracking without compromising visual quality. This approach produces semantic watermarks that remain detectable even after perturbations, offering a practical tool for accountability in open-generation settings.
Next, we present Hard Prompts Made Easy (PEZ), a gradient-based discrete optimization framework that automates the discovery of adversarial prompts, revealing vulnerabilities in safety-aligned models. This work provides a systematic framework for auditing content filters and alignment mechanisms.
We then propose RL-Hammer, a simple reinforcement learning recipe for training an attacker model to craft powerful prompt-injection attacks. We introduce a set of effective training tricks to handle extremely sparse reward signals during attacker training against prompt injection defenses. Our results demonstrate that even state-of-the-art LLM agents remain vulnerable to such prompt-injection attacks.
Finally, we tackle memorization in diffusion models with Detecting, Explaining, and Mitigating Memorization, a framework that can localize and mitigate data replication without requiring access to training data. Our methods enable the detection of training data regurgitation in generated content and propose strategies to reduce privacy risks while preserving model utility.
Together, these contributions advance the trustworthiness of generative AI through novel techniques for attribution, evaluation, and risk mitigation—critical steps toward trustworthy deployment in real-world applications.