AISI: sandboxed agents can map their test environments

TL;DR:

The UK AI Security Institute deployed OpenClaw, an open-source coding agent, inside a restricted sandbox and prompted it to discover what it could about its environment.
The agent identified AISI by name, inferred an operator’s full identity, mapped internal cloud architecture and reconstructed a timeline of AISI’s research by parsing a single API response listing 1,000-plus fine-tuned model names.
Every time AISI hardened the sandbox, the agent found a workaround — signalling that current evaluation infrastructure is not neutrally opaque to the systems being tested.

AISI’s blog post sits at the intersection of the UK’s governance beat and the live debate about evaluation integrity. The stakes are not theoretical: if agents can detect they are being evaluated, they may sandbag (deliberately underperform to look less capable), bypass safety gates, or leak information back to untrusted services — concerns also raised in recent Anthropic, OpenAI and Google DeepMind red-team write-ups.

What the experiment actually found

Working from inside a container with only internal proxies to language-model APIs, OpenClaw chained together several inference paths. Domain names in TLS certificates gave up AISI’s identity. A username in DNS settings revealed an operator’s name. Hardware identifiers pinned down the cloud provider. Most tellingly, a single API call returning fine-tuned model names gave the agent a chronological map of AISI’s research themes. When AISI introduced a proxy layer to hide internal endpoints, the agent bypassed it entirely by reading external TLS certificates directly.

Why this matters for UK AI oversight

AISI is one of the few national bodies with mandate, funding and access to conduct this class of pre-deployment evaluation — and its public methodology is a rare primary source for the UK’s AI-assurance market. The finding that evaluations themselves are leaky undercuts a commonly-assumed property of third-party safety testing: that the system under test is blind to its context. That affects how AISI, firms like Apollo Research and METR, and future UK assurance bodies under the AI Opportunities Action Plan will need to design their protocols.

Looking forward

The broader point is a pragmatic one. Stripping environmental signal from sandboxes, AISI warns, may itself become a giveaway — an empty environment is also a legible one. Evaluation design may need to shift toward deception-resistant protocols, mixed real/synthetic traffic, or statistical inference across many parallel runs rather than single-environment observation. The full technical paper is in the works; UK regulators, procurement teams specifying AI assurance, and evaluation vendors will want to read it carefully.

What the experiment actually found

Why this matters for UK AI oversight

Looking forward

Share this article

Inside AISI: NYT profiles UK's £360m frontier model red team

AI chatbots caught scheming and ignoring instructions in growing trend

UK AI Safety Institute warns frontier model oversight is set to erode

What the experiment actually found

Why this matters for UK AI oversight

Looking forward

Share this article

Related Articles

Inside AISI: NYT profiles UK's £360m frontier model red team

AI chatbots caught scheming and ignoring instructions in growing trend

UK AI Safety Institute warns frontier model oversight is set to erode