OpenAI’s GeneBench-Pro tests AI judgment in biology research
TL;DR:
- OpenAI has released GeneBench-Pro, a benchmark of 129 problems testing whether AI agents can make the judgment calls that real computational biology demands.
- Its strongest model, GPT-5.6 Sol, passes 28.7% of tasks (31.5% in Pro mode) — a sharp rise from below 5% when the original GeneBench launched.
- Reviewers estimate a single problem would take a human expert 20–40 hours, against inference costs of a few dollars.
OpenAI has turned its attention to a harder question than whether AI can recall facts: whether it can exercise scientific judgment. GeneBench-Pro presents models with messy, realistic datasets and asks them to choose an analytical approach, revise assumptions and decide when a result is decision-ready — the “research taste” that separates expert biologists from novices.
Measuring judgment, not recall
The benchmark is built to resist gaming. Each of its 129 problems, spanning genomics, quantitative biology and translational medicine, is generated synthetically so the full causal structure is known, letting OpenAI grade answers deterministically and verify through ablation that plausible-but-wrong analyses fail. Eighty-two problems were reviewed by external domain experts to check they were realistic and solvable.
The results show fast progress and a long way still to run. GPT-5.6 Sol clears 28.7% of tasks at the highest reasoning level, up from under 5% for GPT-5 when the original GeneBench was built, and OpenAI suggests the benchmark could be saturated by year end. Scaling test-time compute made a large difference — at the lowest reasoning setting the model managed only single digits.
The economics are the striking part. Reviewers estimated each problem would take a human expert 20–40 hours, putting the labour cost in the thousands of dollars, against a few dollars of inference. OpenAI is candid that agents remain too unreliable to replace experts, but argues even partial automation could create real value.
The release lands alongside Anthropic’s Claude Science workbench for researchers, underlining how sharply the frontier labs are converging on life sciences. For the UK’s genomics and drug-discovery sector — a research strength the government has repeatedly backed — benchmarks that expose where models still fail are as useful as the capabilities themselves.
Looking forward
OpenAI is open-sourcing 10 representative problems and handing a 50-question subset to Artificial Analysis for independent testing. The value of a benchmark like this is diagnostic: it turns a vague sense that models “struggle with judgment” into something measurable — and, if history holds, something the next model generation will race to beat.