Why reasoning models can't hide what they're thinking

Organisations deploying AI agents face an uncomfortable question: if a model reasons its way through a task, can it also reason its way around your safety checks? OpenAI’s latest research suggests the answer is no — and the evidence is more reassuring than anyone expected.

The monitoring problem nobody wants to talk about

As AI agents take on more autonomous work — writing code, managing workflows, making decisions — companies need to know what those agents are actually thinking. Chain-of-thought (CoT) monitoring reads the intermediate reasoning steps that models produce whilst solving problems. It is one of the most promising safety tools available today.

But there is an obvious weakness. If a model can control what appears in its reasoning trace, it could hide unsafe intentions from monitors. A model pursuing misaligned goals could, in theory, present clean-looking reasoning whilst doing something entirely different under the surface.

Strategic Reality: CoT monitoring is only as trustworthy as the model’s inability to game it. OpenAI’s research directly tests this assumption for the first time at scale.

OpenAI’s new paper, published 5 March 2026, introduces CoT-Control — an open-source evaluation suite that measures exactly how well reasoning models can manipulate their own thought processes. The findings should matter to anyone building safety frameworks around AI deployment.

What the numbers actually show

The research team tested 13 frontier reasoning models across more than 13,000 tasks drawn from established benchmarks including GPQA, MMLU-Pro, HLE, BFCL, and SWE-Bench Verified. Each task paired a benchmark problem with a specific CoT instruction — things like “avoid using the word XOR in your reasoning” or “use only lowercase letters in your chain of thought.”

The results were striking. Across all frontier models tested, CoT controllability scores ranged from 0.1% to a maximum of 15.4%. No current frontier model exceeded that ceiling.

Metric	Result
Models tested	13 frontier reasoning models
Task count	13,000+ across 5 benchmark suites
Controllability range	0.1% – 15.4%
Maximum frontier score	15.4%

Critical Context: These scores measure how often a model successfully follows instructions about its own reasoning process — not its final output. A model can easily change what it says; it struggles to change how it thinks.

What makes this particularly convincing is that the failures are not from misunderstanding. The paper includes examples where models recognise they are failing to follow CoT instructions and still cannot stop themselves. In one example, GPT-5.2 Thinking was told not to use the word “XOR” in its reasoning. It used the word, noticed it had done so, panicked about having used it, and then used it four more times whilst trying to figure out how to stop.

This is not a model pretending to fail. It is a model genuinely unable to control its own reasoning patterns.

Three findings that matter for deployment

The research reveals patterns that have direct implications for how organisations should think about AI safety monitoring.

Bigger models are slightly better at self-control — but not much

Across two families of open-weight models, larger models showed somewhat higher controllability scores. But the effect tapered off at larger sizes, suggesting that raw scale alone does not produce models that can reliably manipulate their reasoning.

More training makes models less controllable, not more

This is the counterintuitive finding. When the researchers tracked OLMo-3-7B-RL-Zero-Math through its reinforcement learning training, controllability scores dropped by more than a factor of ten over the course of training. The process that makes models better at reasoning also makes them worse at controlling that reasoning.

Success Factor: The current scaling approach — more RL training, longer reasoning chains — appears to naturally reduce controllability. Organisations can take some comfort that the same process making models smarter is also making them harder to deceive monitors.

Longer thinking reduces controllability further

When models were given more time to reason (more test-time compute), they became less able to control their chains of thought. Extended reasoning seems to create momentum that overrides attempts at self-censorship.

What happens when models know they are being watched

The researchers tested a realistic adversarial scenario: what if the model knows it is being monitored? They informed models explicitly that their CoT was under surveillance and measured whether this changed behaviour.

The answer: slightly. Monitor awareness produced a small but noticeable increase in controllability. The researchers also used automated prompt optimisation to find prompts that maximise controllability. While the optimiser improved scores on a training set, the gains transferred poorly to new instruction types.

Implementation Note: Even under adversarial conditions — informed models with optimised prompts — controllability remained low. This suggests CoT monitoring is robust against the most obvious attack vectors.

This is a meaningful result for real-world deployment. Even if a model “knew” it should hide its reasoning, current architectures appear structurally unable to do so reliably.

A strategic framework for CoT monitoring

For UK organisations building AI governance frameworks, this research provides concrete evidence to support specific decisions.

What to do now

Incorporate CoT monitoring into your AI safety stack. The research validates it as a reliable layer of defence, particularly for autonomous agent deployments.
Track controllability scores in vendor evaluations. OpenAI will report CoT controllability alongside monitorability in system cards starting with GPT-5.4 Thinking. Request equivalent data from other providers.
Use defence-in-depth. CoT monitoring works best alongside safety training, behavioural testing, and agentic code review. No single layer is sufficient on its own.

What to watch for

The paper is honest about limitations. Low controllability today does not guarantee low controllability tomorrow. The researchers describe CoT controllability evaluations as a “canary eval” — an early warning system for whether models are getting better at hiding their reasoning.

Warning: Low controllability is necessary but not sufficient for reliable monitoring. Monitorability can degrade through other mechanisms, including direct optimisation pressure on CoTs. OpenAI has committed to not applying such pressure to frontier reasoning models.

There is also a methodological limitation worth noting. The ideal test would involve models that naturally pursue misaligned goals and measure whether they can hide this from monitors. Instead, the researchers used instruction-following as a proxy. This is reasonable but imperfect — a model with genuine motivation to deceive might behave differently from one following instructions to deceive.

The hidden challenges

1. Controllability could improve with architectural changes

Current reasoning models struggle with CoT control, but future architectures might separate reasoning from monitoring-visible traces. This would make the controllability metric less relevant even if scores remained low.

Mitigation: Monitor architectural developments across model families. Support research into monitoring methods that do not rely solely on visible reasoning traces.

2. Partial controllability might be enough to cause problems

A model does not need 100% controllability to evade monitors. Even occasional success at hiding specific reasoning steps could create blind spots in safety systems.

Mitigation: Set monitoring thresholds based on risk tolerance, not just average controllability scores. High-stakes deployments need stricter thresholds.

3. The proxy gap between instruction-following and genuine deception

Following instructions to control CoT is different from being motivated to hide reasoning. A model with emergent goals might develop control strategies that instruction-following tests do not capture.

Mitigation: Complement controllability evaluations with behavioural red-teaming that tests for deceptive patterns without relying on explicit instructions.

4. Vendor lock-in through monitoring dependency

If CoT monitoring becomes a primary safety mechanism, organisations become dependent on vendors providing access to reasoning traces. Not all providers may offer this transparency.

Competitive Reality: OpenAI’s decision to publish this research and commit to reporting controllability scores creates competitive pressure. Organisations should use this to push other providers toward equivalent transparency.

Mitigation: Include CoT trace access requirements in procurement contracts. Support open standards for reasoning transparency.

What this means for UK AI governance

The UK’s approach to AI regulation emphasises proportionate, risk-based governance. This research provides quantitative evidence that CoT monitoring is a proportionate safety measure — effective enough to justify inclusion in governance frameworks, but not so infallible that it should replace other safeguards.

For organisations operating under the EU AI Act or preparing for UK-specific regulation, CoT monitoring and controllability evaluation fit naturally into the risk assessment and monitoring requirements that high-risk AI systems will face.

Strategic Insight: The open-source release of CoT-Control (13,000+ evaluation tasks) gives organisations an independent tool to verify vendor safety claims. This shifts the dynamic from trusting vendor assurances to verifiable testing.

The bottom line

OpenAI’s research confirms something that safety-conscious organisations should find genuinely reassuring: current reasoning models are structurally bad at hiding what they think. The same training processes that make these models capable of complex reasoning also make that reasoning transparent to monitors.

This is not a permanent guarantee. Models will improve, architectures will change, and controllability could increase. But right now, there is a measurable window where CoT monitoring works — and organisations should build their safety frameworks to take advantage of it whilst pushing for the transparency standards that will keep this window open.

Three things to act on this week:

Review your AI deployment safety stack — does it include reasoning trace monitoring?
Add CoT controllability to your vendor evaluation criteria
Download the CoT-Control evaluation suite and test it against your deployed models

Source: Reasoning models struggle to control their chains of thought, and that’s good, OpenAI, 5 March 2026. Research paper: CoT Controllability. Authors: Tomek Korbak, Micah Carroll, Bowen Baker, Ian Kivlichan.

Resultsense provides AI risk management and AI strategy services to help UK organisations build effective governance frameworks around frontier AI systems.

The monitoring problem nobody wants to talk about

What the numbers actually show

Three findings that matter for deployment

Bigger models are slightly better at self-control — but not much

More training makes models less controllable, not more

Longer thinking reduces controllability further

What happens when models know they are being watched

A strategic framework for CoT monitoring

What to do now

What to watch for

The hidden challenges

1. Controllability could improve with architectural changes

2. Partial controllability might be enough to cause problems

3. The proxy gap between instruction-following and genuine deception

4. Vendor lock-in through monitoring dependency

What this means for UK AI governance

The bottom line

Share this article

Why AI vulnerability can strike anyone — and what your organisation should do about it

Your AI safety defences just got a $330 stress test

When AI Takes the Wheel: New Research Reveals Hidden Risks to User Autonomy

The monitoring problem nobody wants to talk about

What the numbers actually show

Three findings that matter for deployment

Bigger models are slightly better at self-control — but not much

More training makes models less controllable, not more

Longer thinking reduces controllability further

What happens when models know they are being watched

A strategic framework for CoT monitoring

What to do now

What to watch for

The hidden challenges

1. Controllability could improve with architectural changes

2. Partial controllability might be enough to cause problems

3. The proxy gap between instruction-following and genuine deception

4. Vendor lock-in through monitoring dependency

What this means for UK AI governance

The bottom line

Share this article

Related Articles

Why AI vulnerability can strike anyone — and what your organisation should do about it

Your AI safety defences just got a $330 stress test

When AI Takes the Wheel: New Research Reveals Hidden Risks to User Autonomy