Robust AI for Security

Yizheng Chen
Talk Series: 
12.08.2023 11:00 to 12:00

Artificial Intelligence is becoming more powerful than ever, e.g., GitHub Copilot suggests code to developers, and Large Language Model (LLM) Plugins will soon assist many tasks in our daily lives. We can utilize the power of AI to solve security problems, which needs to be robust against new attacks and new vulnerabilities.In this talk, I will first discuss how to develop robust AI techniques for malware detection. Our research finds that, after training an Android malware classifier on one year’s worth of data, the F1 score quickly dropped from 0.99 to 0.76 after 6 months of deployment on new test samples. I will present new methods to make machine learning for Android malware detection more effective against data distribution shift. My vision is, continuous learning with a human-in-the-loop setup can achieve robust malware detection. Our results show that to maintain a steady F1 score over time, we can achieve 8X reduction in labels indeed from security analysts.Next, I will discuss the potential of using large language models to solve security problems, using vulnerable source code detection as a case study. We propose and release a new vulnerable source code dataset, DiverseVul. Using the new dataset, we study 11 model architectures belonging to 4 families for vulnerability detection. Our results indicate that developing code-specific pre-training tasks is a promising research direction of using LLMs for security. We demonstrate an important generalization challenge for the deployment of deep learning-based models.In closing, I will discuss security issues of LLMs and future research directions.