PhD Proposal: Towards Trustworthy and Capable Language Models

Talk

Abhimanyu Hans

Time:

08.07.2025 11:30 to 13:00

Location:

IRB-4105 https://umd.zoom.us/j/6266008577?pwd=UlFQSDJMMmVoNzlJQzRXWVpmZXZhUT09&om...

URL:

https://talks.cs.umd.edu/talks/4279

The rapid adoption of large language models (LLMs) across various domains has resulted in a significant proportion of digital text being algorithmically generated. This phenomenon raises critical concerns such as social media manipulation, propaganda amplification, and an increase in low-quality digital clutter, creating an urgent need for robust third-party detectors to identify LLM-generated text. Concurrently, as LLM training corpora expand, models inadvertently memorize training data, posing significant risks related to privacy violations, such as leaking personally identifiable information, and copyright infringements.
Addressing these challenges, this work aims to enhance the trustworthiness of language models through two novel contributions. First, we introduce Binoculars, a post-hoc black-box detector specifically designed to identify LLM-generated text. Unlike traditional methods that rely solely on perplexity -- which fails to account for perplexity introduced by complex, unseen prompts -- Binoculars utilizes a novel "cross-perplexity" metric by contrasting distributions from two distinct LLMs. This method effectively calibrates detection against potential prompt-induced complexities, achieving state-of-the-art performance by correctly identifying up to 98% of LLM-generated samples with a minimal false positive rate of 0.01%.
Second, we propose Goldfish Loss, a training recipe derived from first principles explicitly to mitigate verbatim memorization. Goldfish Loss functions by pseudorandomly omitting subsets of tokens during the next-token prediction loss computation in training. Consequently, during inference, models diverge when encountering previously omitted tokens from the original verbatim sequences, substantially reducing memorization risks.
Overall, our work identifies critical shortcomings in the current LLM ecosystem and provides effective methods to increase public trust both in the use and during the development phase of these powerful models.