PhD Proposal: Safety, Robustness and Reliability of AI

Talk
Gaurang Sriramanan
Time: 
05.20.2025 10:00 to 12:00
Location: 

Over the past few years, rapid advancements in Artificial Intelligence (AI) have achieved quantum leaps in performance over a wide range of domains including computer vision and natural language understanding. Given their widespread usage in safety-critical domains such as autonomous navigation, medical diagnosis and surveillance systems, it is imperative to explore and characterize their vulnerabilities and failure modes, and subsequently create robust risk mitigation strategies. In this proposal, we investigate three key dimensions towards addressing this: 1) mitigating the over-sensitivity of deep neural networks to imperceptible noise called adversarial attacks, 2) characterizing the under-sensitivity of computer vision models to large conspicuous changes in their input, thereby identifying “blind-spots” of such models, and 3) analyzing the phenomenon of hallucinations in Large Language Models (LLMs) and their detection using internal model components without incurring significant computational overheads.
First, we analyze the oversensitivity of deep networks to adversarial perturbations constrained within the union of Lp norm balls, and identify critical shortcomings in existing robust training techniques. We present an efficient single-step adversarial training procedure called Nuclear Curriculum Adversarial Training (NCAT), to train networks that are robust against a union of threat models simultaneously, namely the L1, L2 and L-infinity constraint sets. Second, we analyze model undersensitivity – we present a novel Level Set Traversal (LST) algorithm that iteratively uses orthogonal components of the local gradient to identify the “blind spots” of common vision models. We study the geometry of level sets, and show that there exist linearly connected paths in input space between images that a human oracle would deem to be extremely disparate, though vision models retain a near-uniform level of confidence on the same path. Third, we investigate the detection of hallucinations in Large Language Models — outputs that are fallacious or fabricated yet often appear plausible or tenable at first glance — using LLM-Check, an effective suite of techniques that only rely upon the internal hidden representations, attention similarity maps and logit outputs of an LLM. We demonstrate its efficacy over broad-ranging settings and diverse datasets: from zero-resource detection to cases where multiple model generations or external databases are made available at inference time, or with varying access restrictions to the original source LLM