Testing AI Before It Touches the Network
As artificial intelligence systems take on a growing role in automating technical tasks, researchers are examining how those tools can be safely applied to critical infrastructure. In areas such as network and system operations, even a small error by automated software can disrupt services used by thousands of people. That risk has made many operators cautious about relying on AI tools, despite their potential to reduce the time required to diagnose and fix system problems.
At the University of Maryland’s Department of Computer Science, Ph.D. student Yajie Zhou is studying how AI agents can be designed and evaluated for use in these environments. Her research focuses on building trustworthy AI systems capable of assisting with tasks such as diagnosing failures and managing large-scale computing infrastructure while maintaining safeguards that prevent harmful outcomes.
Zhou’s work centers on a growing interest in using AI agents to automate operational tasks in complex computing systems. Modern enterprises often rely on large microservice architectures, where hundreds or thousands of interconnected services must be configured and monitored. When problems occur, engineers typically search through system logs and configuration files to identify the source of the issue.
“AI agents could examine policy files or logs and diagnose where the problem is,” Zhou said. “If the agent identifies the issue correctly, it can also suggest configuration changes or commands that operators can use to fix the system.”
However, the consequences of mistakes make the adoption of AI in this domain more complicated than in other applications.
“In system and network operations, the stakes are very high,” Zhou said. “One simple mistake that an AI agent makes could create a large outage. Because of that risk, many companies are hesitant to rely on AI tools in these environments.”
Part of the challenge is that researchers lack reliable ways to evaluate how AI agents behave in these settings.
“In many other domains, such as mathematics or coding, benchmarks use a fixed dataset,” Zhou said. “But in network and system operations we do not have many benchmarks, and static datasets can create problems such as data contamination.”
Data contamination occurs when benchmark datasets become incorporated into the training data used to build new models. When that happens, models may appear to perform well on evaluation tests because they have already encountered similar examples during training.
To address these issues, Zhou developed NetArena, a benchmarking framework designed to test how AI agents perform in network and system environments.
NetArena was introduced in a paper accepted to the International Conference on Learning Representations (ICLR) 2026. The system provides researchers with a platform for evaluating how AI agents perform when given operational tasks in simulated computing environments.
Instead of relying on a fixed dataset, NetArena dynamically generates test scenarios. The benchmark can produce thousands of new tasks, allowing researchers to evaluate AI agents across a wider range of situations and reduce statistical bias in results.
Another feature of NetArena is its ability to evaluate both correctness and safety.
“The feedback from the emulator allows us to evaluate not only whether the agent produced the correct answer, but also whether the action is safe for the system,” Zhou said. “Safety is the biggest concern for operators before they consider using AI.”
Zhou is advised by Assistant Professor Alan Liu in the Department of Computer Science. Her work on NetArena also placed first in the coding agent category at the Berkeley AgentX AgentBeats competition, which evaluates the performance of AI agents across several technical tasks.
Zhou was also recently awarded an Ann G. Wylie Dissertation Fellowship from UMD, which supports doctoral candidates completing dissertations that contribute to their fields.
For Zhou, the acceptance of her NetArena paper at ICLR 2026 marked an important step in connecting research communities working in both networking systems and artificial intelligence.
“I mostly work in networking and systems research,” Zhou said. “This was my first time publishing in a major AI conference. Having recognition from both communities means a lot because the work sits between these two domains.”
Looking ahead, Zhou hopes her research will influence how developers and operators evaluate AI systems intended for critical infrastructure.
“Accuracy alone is not enough,” Zhou said. “In real systems, safety must come first.”
—Story by Samuel Malede Zewdu, CS Communications
The Department welcomes comments, suggestions and corrections. Send email to editor [-at-] cs [dot] umd [dot] edu.
