AI Builds Better Cybersecurity Rewards Than Humans Can

TL;DR

A new method uses large language models to design reward systems for cyber defense agents, beating expert-built strategies against varied attack types.

In the fast-evolving world of cybersecurity, defenders face a daunting : designing effective automated systems that can adapt to increasingly sophisticated attacks. Traditional approaches rely heavily on subject matter experts to craft reward structures for reinforcement learning agents, a process that becomes overwhelming as network complexity grows. Researchers at Pacific Northwest National Laboratory have developed a novel solution by leveraging large language models (LLMs) to generate these rewards, potentially transforming how autonomous cyber defense systems are built and deployed. This approach addresses a critical bottleneck in AI-driven security, where human expertise alone struggles to keep pace with dynamic threats.

The key finding from this research is that LLM-guided reward designs can produce effective defense strategies against diverse adversarial behaviors. The study demonstrates that an LLM, specifically Claude Sonnet 4, can translate qualitative behavioral traits of cyber attackers and defenders into quantitative reward structures. For example, when tasked with making a defender more proactive, the LLM suggested higher incentives for decoy placements and penalties for inaction, as shown in Tables 3 and 4 of the paper. Against stealthy and aggressive attackers, these LLM-informed rewards led to defense policies that delayed the attacker's first impact by up to 18 time steps in the worst-case scenario, outperforming baseline human-designed rewards in several scenarios.

Ology centers on an LLM-assisted framework within a deep reinforcement learning (DRL) environment called Cyberwheel. Researchers first provided the LLM with contextual information about the simulation environment, including network topology and baseline reward structures from subject matter experts. They then crafted prompts to generate rewards for different agent personas, such as aggressive or stealthy attackers and proactive defenders. These rewards were integrated into a DRL training process using proximal policy optimization (PPO), with hyperparameters detailed in Table 1. The LLM's role was to map behavioral constraints into numerical rewards, which were then used to train defense agents in a 15-host network simulation, as illustrated in Figure 1.

From the experiments reveal nuanced insights into how reward structures influence defense behavior. Figure 7 shows action distributions for blue agents against different red personas, indicating that proactive defenders deployed more decoys against aggressive attackers. For instance, the proactive-v1 agent placed decoys 39.5% of the time against aggressive attackers, compared to 35.1% against stealthy ones. Figure 8 presents complementary empirical cumulative distributions of time to first impact, with the 95th percentile values serving as a robust metric. Against stealthy attackers, the proactive-v2 blue agent achieved a 95th percentile delay of 18 time steps, outperforming other configurations. These data points suggest that LLM-designed rewards can tailor defense strategies to specific threat types, enhancing overall resilience.

Of this research extend beyond academic interest, offering practical benefits for real-world cybersecurity operations. By automating reward design, organizations can reduce the cognitive burden on experts and accelerate the development of adaptive defense systems. The study's conclusion highlights that a mixed strategy, where defense agents switch personas based on the attacker type, may be most effective. This aligns with the need for flexible, explainable AI tools in critical infrastructure protection, as emphasized in recent policy frameworks like America's AI Action Plan. For everyday readers, this means more robust digital defenses that can autonomously respond to evolving threats without constant human oversight.

Despite these promising , the research has notable limitations. The study focused on a heuristic-based attacker rather than a learning agent, which may not fully capture the adaptive nature of real-world adversaries. Additionally, the experiments were conducted in a simulated environment with a 15-host network, and scaling to larger, more complex state spaces remains an open . Future work, as mentioned in the paper, could involve enabling both attackers and defenders to be learning agents, incorporating multiple agents, and refining rewards through an LLM-in-the-loop approach. These steps are necessary to validate 's robustness in dynamic, real-world scenarios where cyber threats continuously evolve.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn