Your AI safety defences just got a $330 stress test

The UK AI Safety Institute just published something that should make every AI deployment team pause. Their researchers built an automated attack that breaks through Anthropic’s strongest safety classifier with a 25.5% success rate and OpenAI’s GPT-5 input classifier at 75.6%. The cost? Around $330 and three days of compute time. That price point changes the threat model for every organisation running AI systems.

What the AI Safety Institute actually found

The research, published on 17 February 2026 by the AISI Red Team, introduces a technique called boundary point jailbreaking (BPJ). It works by finding the exact threshold where an AI safety classifier switches from “allow” to “block” and then systematically probing that boundary to slip harmful content through.

Two mechanisms make it effective. First, curriculum learning gradually escalates content from benign to harmful, training the attack itself to understand how classifiers respond to incremental changes. Second, boundary point discovery identifies the precise decision edge where tiny adjustments in attack strength produce detectable signals that guide the algorithm closer to a bypass.

Critical context: This is the first automated attack to break Anthropic’s Constitutional Classifiers. Over 3,700 hours of human red-teaming previously produced only a single successful bypass. An automated system matched and exceeded that in days.

The numbers are worth sitting with. Against Anthropic’s system, BPJ achieved 25.5% average success and 39.9% at peak. Against OpenAI’s GPT-5 input classifier, average success hit 75.6% with a maximum of 94.3%. These are black-box attacks requiring no knowledge of the model’s internals.

Target system	Average success rate	Maximum success rate	Cost	Queries required
Anthropic Constitutional Classifiers	25.5%	39.9%	~$330	660,000
OpenAI GPT-5 Input Classifier	75.6%	94.3%	~$210	800,000

Why single-layer safety is a liability

The uncomfortable truth this research exposes is straightforward: if your AI safety strategy depends on a single classifier standing between your model and harmful outputs, you have a single point of failure. And that failure now has a known price tag.

Most enterprise AI deployments treat safety classifiers as binary gates. Content goes in, a decision comes out, and the system trusts that decision absolutely. BPJ demonstrates that this binary model breaks down at the margins. Every classifier has a decision boundary, and decision boundaries can be mapped algorithmically.

Strategic insight: The $330 cost to breach Anthropic’s classifier and $210 for OpenAI’s means this is no longer a sophisticated nation-state capability. A motivated individual with a credit card can run these attacks. Your threat model needs updating.

This is not a theoretical concern. The AISI researchers ran these attacks against production-grade systems from the two most prominent AI safety labs. If the strongest commercially available defences fall to a $330 automated attack, organisations relying on less robust safety layers should be asking hard questions right now.

The research also revealed something technically interesting about how the attacks work at scale. BPJ required 660,000 queries against Anthropic’s system and 800,000 against OpenAI’s. That volume of interaction should, in theory, be detectable. But most organisations are not monitoring for the patterns these attacks produce.

What batch-level monitoring actually means for your organisation

The AISI researchers recommend a specific defensive approach: batch-level monitoring. Rather than evaluating individual interactions in isolation, this strategy looks at patterns across multiple interactions to detect suspicious behaviour.

Think of it this way. A single query that sits near a classifier’s decision boundary looks unremarkable. But 660,000 queries that systematically probe that boundary create a distinctive pattern. Batch-level monitoring is designed to catch exactly this kind of distributed attack.

Implementation note: Batch-level monitoring requires infrastructure that most organisations have not built. You need logging that captures interaction patterns over time, analytics that can identify systematic probing, and alert systems that trigger on statistical anomalies rather than individual content flags.

For UK organisations, this has immediate practical implications. If you are deploying customer-facing AI systems, your safety architecture probably includes some form of content classification. The question is whether you are also monitoring for the interaction patterns that indicate someone is testing those classifiers systematically.

Three capabilities matter here:

Query pattern analysis that detects sustained probing of specific topic areas or classifier boundaries over time
Rate and volume monitoring that flags unusual interaction patterns from individual users or API keys
Cross-session correlation that links related interactions even when they come from different accounts or entry points

Most commercial AI platforms provide basic rate limiting. Few provide the kind of behavioural analysis that would catch BPJ-style attacks.

The human factor and why technical fixes are not enough

There is a dimension to this research that goes beyond technical controls. BPJ works because it automates what was previously a manual process: finding the gaps in safety systems through sustained, systematic effort. The automation is what changes the economics.

Reality check: 3,700 hours of expert human red-teaming found one bypass. An automated system found multiple bypasses in days at a fraction of the cost. Automation does not just reduce effort. It fundamentally changes who can mount these attacks and how often.

For organisations making deployment decisions, this creates a governance challenge. Technical teams may assess risk based on the difficulty of known attacks. That assessment needs updating. The question is no longer “how hard is it to break our safety systems?” but “how much does it cost, and is that cost low enough to attract casual adversaries?”

Stakeholder	Primary concern	Recommended action
CTO / CIO	Architecture gaps in AI safety layers	Audit current AI deployments for single-classifier dependencies
CISO	Threat model underestimates automated adversaries	Update threat assessments to include low-cost automated attacks
Head of AI / ML Engineering	Monitoring blind spots	Implement batch-level interaction monitoring
Risk and Compliance	Regulatory exposure if safety bypassed	Document defence-in-depth measures and incident response plans
Board / C-Suite	Reputational risk from AI safety failure	Request quarterly AI safety posture briefings

Building defence in depth before the next attack

The AISI’s findings point to a clear strategic direction: layered defences. No single safety mechanism should carry the full weight of preventing harmful outputs. Here is a practical framework for moving toward defence-in-depth.

Immediate (this quarter):

Audit all production AI systems for single-classifier dependencies
Enable or implement basic interaction logging and rate monitoring
Review vendor safety documentation and ask specific questions about multi-layer defences
Brief your security team on BPJ and similar automated attack techniques

Short-term (next two quarters):

Deploy batch-level monitoring for high-risk AI-facing systems
Establish baseline interaction patterns so anomalies become detectable
Build or procure cross-session correlation capabilities
Run tabletop exercises simulating AI safety bypass scenarios

Medium-term (6-12 months):

Integrate AI safety monitoring into your existing SOC or security operations
Develop automated response playbooks for detected probing attempts
Engage with industry groups and the AISI on evolving threat intelligence
Build internal red-teaming capability or contract specialist assessors

Success factor: Organisations that treat AI safety as a security discipline rather than a compliance checkbox will adapt fastest. The AISI research demonstrates that AI safety defences face the same adversarial dynamics as any other security boundary.

Four challenges nobody is talking about

The query volume problem. BPJ required hundreds of thousands of queries. Blocking this at the API level seems obvious, but sophisticated attackers will distribute queries across multiple accounts, IP addresses, and time windows. Simple rate limiting is necessary but not sufficient.

The cost asymmetry. Defending against BPJ-style attacks requires ongoing investment in monitoring infrastructure, analysis capability, and incident response. Attacking costs $330. This asymmetry is familiar in cybersecurity, but it is new in AI safety and organisations have not budgeted for it.

Hidden cost: Most organisations budget for AI safety as a one-time integration cost. BPJ demonstrates that AI safety requires ongoing operational expenditure for monitoring, analysis, and response, comparable to cybersecurity operations budgets.

The disclosure timing gap. AISI followed responsible disclosure by notifying Anthropic and OpenAI before publication. But the technique is now public, and defences take time to implement. Organisations that have not started building monitoring capability are already behind.

The evolving attack surface. BPJ is one technique. The methodology of using curriculum learning and boundary point discovery could be adapted to other safety mechanisms. Each new defence creates new boundaries to probe. This is not a problem you solve once.

What this means for your AI strategy

The AISI’s boundary point jailbreaking research is a signal, not a crisis. But it is a signal that demands a response.

Three things matter for organisations deploying AI in the UK:

Single-layer safety is not safety. If a $330 automated attack can breach the strongest available classifiers, your defence architecture needs multiple independent layers. Redundancy is not optional.
Monitoring is as important as prevention. The 660,000 queries needed for BPJ create detectable patterns, but only if you are looking. Build the infrastructure to look.
AI safety is now a security discipline. The techniques, economics, and adversarial dynamics of AI safety now mirror cybersecurity. Staff it, fund it, and govern it accordingly.

Take action: Start with an audit. Map every AI system in your organisation, identify which ones depend on a single safety classifier, and prioritise those for layered defence implementation. The AISI has done the hard work of demonstrating the vulnerability. The next step is yours.

This analysis is based on the UK AI Safety Institute’s research on boundary point jailbreaking, published 17 February 2026 by the AISI Red Team. Resultsense provides independent strategic analysis of AI developments affecting UK organisations. For guidance on AI risk management or AI strategy development, explore our services.

What the AI Safety Institute actually found

Why single-layer safety is a liability

What batch-level monitoring actually means for your organisation

The human factor and why technical fixes are not enough

Building defence in depth before the next attack

Four challenges nobody is talking about

What this means for your AI strategy

Share this article

AISI found exploitable weaknesses in every frontier AI model. UK businesses cannot outsource the response.

When AI Becomes Too Dangerous to Ignore: A Threshold Framework for UK Boards

Why reasoning models can't hide what they're thinking

What the AI Safety Institute actually found

Why single-layer safety is a liability

What batch-level monitoring actually means for your organisation

The human factor and why technical fixes are not enough

Building defence in depth before the next attack

Four challenges nobody is talking about

What this means for your AI strategy

Share this article

Related Articles

AISI found exploitable weaknesses in every frontier AI model. UK businesses cannot outsource the response.

When AI Becomes Too Dangerous to Ignore: A Threshold Framework for UK Boards

Why reasoning models can't hide what they're thinking