Microsoft Research: frontier AI loses 25% of document content in long tasks

TL;DR:

  • A Microsoft Research preprint titled “LLMs Corrupt Your Documents When You Delegate” finds that frontier models — Gemini 3.1 Pro, Claude 4.6 Opus and GPT 5.4 — lose on average 25 per cent of document content over 20 delegated interactions, with all-model average degradation closer to 50 per cent.
  • Of 52 professional domains tested under the new DELEGATE-52 benchmark, only one (Python programming) cleared the researchers’ 98-per-cent “ready” bar; in 80 per cent of model–domain combinations, scores fell below 80 per cent — what the paper calls catastrophic corruption.
  • Resultsense view: this is the first Microsoft-published benchmark that directly contradicts the marketing pitch for agentic AI, and it lands as Deloitte data shows organisations now spend 36 per cent of their digital budgets on AI automation — most of which depends on the model behaviour these results undermine.

Researchers Philippe Laban, Tobias Schnabel and Jennifer Neville set out to study how large language models handle multistep knowledge work. DELEGATE-52 simulates long workflows across 52 domains from coding to crystallography to music notation, with the model splitting and remerging documents over 20 sequential interactions.

What the results actually say

Stronger models do not avoid small errors better — they delay critical failures and then experience them in fewer, larger interactions. The Microsoft team observed losses of 10 to 30 points in a single round-trip rather than steady accumulation. Best-in-class Gemini 3.1 Pro cleared the readiness bar for just 11 of 52 domains; the rest of the frontier lineup fared worse.

Agentic harnesses did not help. Giving the same models file read/write and code-execution tools made performance worse by an average of 6 percentage points, across GPT 5.4, 5.2, 5.1 and 4.1 — the opposite of the implicit pitch in commercial agent products.

How it pairs with the UK regulatory mood

The findings land in the same week that the Bank of England’s PRA warned of “quite significant disruption” to UK financial services from new frontier models, and that the NCSC published a ten-question checklist before using AI to discover vulnerabilities. The recurring theme is operational caution rather than capability scepticism — the models are powerful enough to matter, and not yet reliable enough to delegate to without oversight.

Looking forward

Expect agentic-AI vendors to respond with longer-horizon benchmarks of their own, and expect enterprise procurement to start asking for them by name. UK enterprises evaluating AI agents for back-office work should treat two-interaction demos as effectively meaningless: Microsoft’s data shows performance at 20 interactions bears no useful relationship to performance at two. The honest benchmark for “ready to delegate” is now a 20-interaction floor, and a clear plan for the inevitable failures when that floor is breached.