PhD Proposal: Measuring What Matters in Trustworthy AI: Certified Robustness to Agentic Safety
As AI systems move into security-critical roles, trustworthiness must be established under the conditions that matter in deployment -- adaptive adversaries, and low-tolerance for rare but high-impact failures. My works advance a measurement-first agenda for trustworthy AI that pairs two complementary approaches: certified guarantees for well-specified threat models, and realistic adversarial evaluation for complex generative and agentic systems where formal guarantees are incomplete. On the certified side, DRSM develops a de-randomized smoothing methodology for malware detection that provides formal robustness certificates, enabling security assurances beyond empirical accuracy for safety-critical classification. On the empirical side, adversarial methods and benchmarks are used to stress-test modern GenAI along failure modes that dominate real-world risk: efficient red-teaming attacks against aligned language models, detection settings for hallucinations and unreliable outputs, and robust evaluation of AI-authorship signals under AI-polishing and adversarial paraphrasing -- particularly in low false-positive operating regimes. Building on these foundations, my forward direction targets agentic safety for tool-using AI agents acting over multiple steps, where harm and reliability are defined by downstream outcomes rather than single-turn text. The goal is to develop evaluation principles that remain effective through planning, tool interaction, memory, and environment feedback -- linking provable robustness and deployment-aligned testing into a unified framework for secure, reliable AI.