PhD Defense: Uncovering, Understanding, and Mitigating Social Biases in Language Models
IRB IRB-4109
This dissertation investigates how language models, including contemporary LLMs, can perpetuate social biases related to gender, race, and ethnicity as inferred from first names. Guided by the principle of counterfactual fairness, we use name substitution to uncover, understand, and mitigate these biases across three domains: stereotypes about personal attributes, occupational bias, and overgeneralized assumptions about romantic relationships.
By analyzing model behavior across diverse names, this dissertation reveals patterns of unfair treatment, such as personality judgments in social commonsense reasoning influenced by demographic associations, discrimination in hiring based on gender, race, and ethnicity, and heteronormative bias in relationship predictions. To address these issues, we propose open-ended diagnostic frameworks, interpretability analyses based on contextualized embeddings, and a novel consistency-guided finetuning method.
Together, these contributions aim to build fairer, more interpretable, and more inclusive language technologies.