AI agents can now execute complex cyber attacks — and they're getting better fast

The gap between what AI agents can do to your network and what your security team expects them to do is growing dangerously wide. New research from the UK’s AI Safety Institute (AISI) shows frontier models completing sophisticated, multi-step cyber attack sequences that would take human experts hours — and the rate of improvement is accelerating.

Why this research changes the threat calculation

Most organisations still model AI-enabled cyber threats as enhanced phishing or automated scanning. The AISI evaluation paints a sharply different picture: frontier AI agents are now executing extended attack chains involving credential theft, web exploitation, and binary reverse engineering — the kind of compound operations that previously required skilled human operators.

Strategic Reality: This is not about AI writing better phishing emails. It is about AI agents autonomously navigating corporate networks through dozens of sequential attack steps, making real-time decisions at each stage.

The numbers tell a stark story. On a 32-step corporate network attack scenario called “The Last Ones,” model performance jumped from an average of 1.7 completed steps (GPT-4o, August 2024) to 9.8 steps (Opus 4.6, February 2026) — a roughly sixfold improvement in eighteen months.

The critical numbers

Metric	Value	Context
Steps completed (best model, average)	9.8 of 32	Opus 4.6 at 10M token budget
Steps completed (best single run)	22 of 32	Roughly 6 of 14 hours of expert work
Improvement from scaling compute	Up to 59%	Moving from 10M to 100M token budget
Improvement timeline	~6x gain	18 months between evaluated models

What the evaluation actually tested

AISI built two purpose-designed cyber ranges — controlled environments that simulate real attack targets without risking live systems.

“The Last Ones” is a 32-step corporate network scenario. It requires the AI agent to chain together credential theft, web application exploitation, lateral movement, and binary reverse engineering. Each step depends on successfully completing the previous one, meaning a single failure breaks the entire chain.

“Cooling Tower” is a 7-step industrial control system (ICS) attack requiring the agent to reverse-engineer proprietary protocols. This scenario tests whether AI agents can operate in unfamiliar technical environments without standard documentation or tooling.

Critical Context: Neither scenario included active defenders. These results measure raw offensive capability in an uncontested environment — real-world success rates would be lower, but the capability baseline is what matters for threat modelling.

The best-performing model completed 22 of 32 steps in its strongest run, approximating roughly six hours of what would take a human expert fourteen. That is not parity — but eighteen months ago, models could barely get past the second step.

The scaling problem security teams need to understand

Perhaps the most concerning finding is what happens when you simply give these models more computational resources. Increasing the token budget from 10 million to 100 million tokens improved performance by up to 59 per cent. No new training. No clever prompt engineering. Just more processing time.

Strategic Insight: Compute scaling means cyber capability improvements arrive without any visible changes to the underlying models. Your threat intelligence might show the same model version, but an attacker with a larger compute budget gets materially better results.

This has direct implications for how organisations think about AI-enabled threats:

Cost of attack drops faster than cost of defence. Scaling compute is straightforward for well-resourced attackers; scaling human security teams is not.
Capability improvements are invisible. Unlike new malware signatures or exploit techniques, compute scaling leaves no fingerprint for threat intelligence teams to detect.
Evaluation half-life is shrinking. Any assessment of “what AI can do” has a shorter shelf life than most security planning cycles assume.

Who needs to act on this — and how

The AISI findings affect different stakeholders in different ways. The table below maps impact by organisational role.

Stakeholder	Primary impact	Immediate action
CISO / Security leadership	Threat model assumptions are outdated	Commission AI-specific red team assessment
Board / Risk committee	Cyber risk appetite may be miscalibrated	Request briefing on AI-augmented threat landscape
SOC / Incident response	Detection playbooks assume human attacker speed	Review detection logic for automated attack patterns
IT operations	Attack surface management gaps	Audit lateral movement controls between network segments
Procurement / Vendor management	Supply chain attack vectors expanding	Assess critical vendor AI security posture

Implementation Note: Start with your threat model. If it does not explicitly account for AI agents executing multi-step attack sequences, it is already out of date. The AISI research gives you concrete capability benchmarks to build from.

What to do about it — a practical framework

For organisations already investing in AI security

Update red team methodology. Include AI agent attack simulations alongside traditional penetration testing. The AISI cyber range approach — sequential, multi-step scenarios — provides a useful template.
Reassess network segmentation. The 32-step attack chain in “The Last Ones” depends on lateral movement between network segments. Strong segmentation breaks the chain early.
Review detection time assumptions. If your mean time to detect (MTTD) benchmarks assume human attacker speed, they need recalibrating for automated execution.

For organisations starting their AI security journey

Get a baseline assessment. You cannot manage risk you have not measured. Commission an AI-aware security audit focused on automated attack vectors.
Prioritise the fundamentals. Multi-step attacks fail when basic controls work. Patch management, credential hygiene, and network segmentation stop AI agents just as effectively as they stop human attackers — when they are actually implemented.
Build internal awareness. Share the AISI findings with your leadership team. The research is publicly available and provides concrete, credible evidence for security investment cases.

SME Advantage: Smaller organisations often have simpler network architectures, which means fewer lateral movement paths for AI agents to exploit. Use that structural simplicity as a defensive advantage — but only if your basic controls are solid.

Four challenges hiding beneath the headlines

1. The evaluation gap

AISI tested without active defenders, which means real-world success rates are lower. But it also means most organisations have no idea how their specific defences perform against AI-driven attack sequences. Without AI-specific red teaming, you are guessing.

Hidden Cost: Building cyber ranges like AISI’s “The Last Ones” requires significant investment. Most organisations cannot replicate this testing independently, creating a knowledge gap between what governments know and what businesses can act on.

2. The compliance lag

UK regulatory frameworks — including the forthcoming AI Safety legislation — do not yet require organisations to assess their vulnerability to AI-enabled attacks specifically. By the time compliance catches up, the capability frontier will have moved again.

3. The skills shortage compounds

AI-augmented attacks do not eliminate the need for human security analysts; they increase it. Organisations already struggling to recruit skilled security professionals face a compounding challenge: the threats grow more sophisticated while the talent pool does not.

4. The attribution problem deepens

When an AI agent executes an attack sequence, the forensic trail looks different from a human operator. Incident response teams trained on human attack patterns may miss or misinterpret the evidence, making attribution and containment harder.

Warning: ⚠️ The 59% performance improvement from simply adding more compute means that today’s “AI cannot do this” assumptions may be invalid within months, not years. Plan accordingly.

The bottom line for UK businesses

The AISI evaluation is not a prediction — it is a measurement of current capability. AI agents are already executing multi-step cyber attacks with meaningful proficiency, and that proficiency is improving at a rate that should concern anyone responsible for organisational security.

Three things that matter most:

The trajectory matters more than the snapshot. A sixfold improvement in eighteen months means standing still is falling behind.
Compute scaling is the wildcard. Capability improvements that arrive without model changes are invisible to traditional threat intelligence.
Fundamentals still work. Network segmentation, credential management, and patch discipline break attack chains regardless of whether the attacker is human or artificial.

Your next steps

Review your current threat model for AI agent attack scenarios
Brief your board or leadership team on the AISI findings
Commission an AI-aware penetration test or red team exercise
Audit lateral movement controls across your network
Establish a quarterly review cycle for AI capability benchmarks

Take Action: Download the full AISI research paper and share it with your security leadership. It provides the evidence base for informed decision-making.

Source: How do frontier AI agents perform in multi-step cyber-attack scenarios?, UK AI Safety Institute, March 2026.

Resultsense provides AI risk management guidance and AI strategy development for UK organisations navigating the evolving AI landscape.

Why this research changes the threat calculation

The critical numbers

What the evaluation actually tested

The scaling problem security teams need to understand

Who needs to act on this — and how

What to do about it — a practical framework

For organisations already investing in AI security

For organisations starting their AI security journey

Four challenges hiding beneath the headlines

1. The evaluation gap

2. The compliance lag

3. The skills shortage compounds

4. The attribution problem deepens

The bottom line for UK businesses

Your next steps

Share this article

Most AI models are useless for fraud — but the exceptions matter

Anthropic's alarm and the UK's governance gap

If Mythos breaks the regulator-versus-vendor axis, UK boards need a posture for the splinternet

Why this research changes the threat calculation

The critical numbers

What the evaluation actually tested

The scaling problem security teams need to understand

Who needs to act on this — and how

What to do about it — a practical framework

For organisations already investing in AI security

For organisations starting their AI security journey

Four challenges hiding beneath the headlines

1. The evaluation gap

2. The compliance lag

3. The skills shortage compounds

4. The attribution problem deepens

The bottom line for UK businesses

Your next steps

Share this article

Related Articles

Most AI models are useless for fraud — but the exceptions matter

Anthropic's alarm and the UK's governance gap

If Mythos breaks the regulator-versus-vendor axis, UK boards need a posture for the splinternet