PhD Proposal: Mechanistic and Black-Box Interpretability for Attribution in Neural Networks
IRB 5105 https://umd.zoom.us/j/9500203024
With the introduction of larger and more powerful neural networks each year, the inner workings of these systems get more and more opaque. The attribution problem for neural network models is concerned with identifying the specific parts of the input or model that are responsible for some specified model output or behavior. Many problems in interpretability such as constructing saliency maps, discovering semantically important directions in representation space, and finding sub-networks responsible for some model behavior can all be understood as specific instances of the attribution problem in its broadest sense. My research aims to attack the problem from both white-box (mechanistic) and black-box perspectives. From a mechanistic perspective, I have proposed a new masking method for CNNs which can enhance the fidelity of input attributions in downstream applications, as well as decomposition-based techniques for both model and input attributions in ViTs and ViT-CNN hybrids. From a black-box perspective, using gradient based methods, I have shown the existence of large connected regions in input space spanning distinct image classes that do not affect the model output. I have also investigated the usefulness of the chain-of-thought for input attribution in multimodal LLMs, showing significant reliability gaps even compared to text-only LLMs. I aim to continue my research on interpretability methods, with more focus on novel paradigms such as agentic systems and reasoning models.