PhD Defense: Identifying and Mitigating Bias in Machine Learning for Healthcare

Talk
Daniel Smolyak
Time: 
07.16.2025 10:00 to 12:00

The use of machine learning in healthcare settings has become increasingly common, from prediction of individual patient outcomes to supporting policy decision-making for public health officials. However, these machine learning models often replicate or exacerbate human biases and discrimination. In this dissertation, we seek to address this problem both through identification of bias in existing healthcare modeling settings and through the development of approaches to mitigate bias, focusing on several complementary problems.
We audit predictive models of county-level COVID-19 cases, identifying whether models perform equally well across counties with different demographic compositions when a) human mobility data is included as a model feature and b) when various approaches are used to correct case underreporting. We also investigate approaches to improve model performance, specifically for small subgroups. We develop a regression model for joint estimation of multiple groups that uses sample weighting and separate sparsity penalties to boost model performance for smaller groups. Then we outline an easy-to-implement LLM-based synthetic data generation method to augment smaller, underrepresented groups in health datasets, conducting a comprehensive evaluation of two prompt templates and three LLMs across two health datasets. Lastly, we present a novel use of causal machine learning methods to investigate sociodemographic subgroups with heterogeneous racial health disparities.
Given structural inequities in allocation of health resources to marginalized communities and current disparities in a wide range of health outcomes, it is important that we both prevent machine learning systems from causing further harm through perpetuation of allocation inequities and leverage machine learning approaches to actively correct these harms.