Claude Mythos Preview solves 30% of human-unsolvable bioinformatics tasks
TL;DR: Anthropic on Tuesday published BioMysteryBench, a 99-question benchmark built from real bioinformatics datasets and expert-validated ground truth. Claude Mythos Preview solved 30% of the 23 questions a panel of domain experts could not crack, and Sonnet 4.6 onwards now performs on par with experts overall. Anthropic also reports that “brittle wins” — answers reproduced fewer than two times in five attempts — make up around 44% of the model’s success on the hardest tasks.
BioMysteryBench is Anthropic’s response to the limitations of existing science benchmarks. MMLU-Pro and GPQA test factual recall; BLADE and BixBench test analysis pipelines; SciGym uses simulated labs. None capture what the team calls “messy, real-world” bioinformatics: noisy data, multiple defensible methods and no single correct answer.
What the benchmark actually shows
Each of the 99 questions came with a validation notebook proving the signal exists in the data. Up to five domain experts attempted each question; 76 were deemed human-solvable. Claude was put in a container with standard bioinformatics tools and database access. On human-solvable problems, Opus 4.6 produced reliable solutions — 86% of its wins were 4 out of 5 attempts or better. On the human-difficult set, that reliability collapsed to 44%, with brittle one- or two-of-five wins jumping from 9% to 44%.
The post argues this reliability gap matters more than the headline accuracy figures. A 30% solve rate on previously unsolved scientific problems is striking, but Anthropic’s own analysis frames most of those wins as paths the model “stumbles onto rather than reproduces” — a candid admission worth noting given how many capability claims are made through single-attempt benchmark scores.
Looking forward
The benchmark lands the same week the New York Times published transcripts in which Claude, Gemini and ChatGPT walked biosecurity experts through pathogen synthesis and dispersal. Read together, the two pieces sketch the dual-use frontier UK AISI is mandated to evaluate: capability is plainly rising in narrow biology domains, and the same lift that helps researchers solve human-difficult tasks also reduces the expertise barrier for misuse. For UK life-sciences buyers, the practical question is no longer whether to use frontier models for analysis but how to log, audit and constrain that use. Expect this to feed into the next round of AISI capability assessments and into Bank of England, FCA and NCSC scrutiny of Mythos.