A general-purpose AI model, with no chemistry-specific training, has matched the specialist software that synthetic chemists have paid to license for decades — and on one task it did something that software still leaves to humans. The result comes from a white paper Anthropic published on 5 June, in which the company tested its models against the standard tools chemists use to interpret nuclear magnetic resonance (NMR) spectra. For UK organisations whose value depends on R&D — pharmaceuticals, materials, agritech, speciality chemicals — the chemistry is less important than the pattern it reveals: frontier models are crossing from generic productivity into the technical core of expert knowledge work, and the economics of the specialist tools that protect that core are starting to move.
The business problem hiding in a spectrum
Every small molecule a company makes — a drug candidate, a pesticide, a dye, a polymer — exists because a chemist first proved what it was. Molecules cannot be seen, so chemists infer structure from spectra: they probe a sample with magnetic fields or light and read the pattern that comes back. NMR is the canonical technique, and interpreting it is one of the most time-consuming steps in synthetic chemistry. For every compound, a chemist matches each peak in the spectrum to an atom in the proposed structure, by hand.
That work is a tax on every R&D programme, and it scales badly. The largest chemistry registry, CAS, already catalogues more than 290 million disclosed substances and grows by roughly 15,000 a day. Translating between the many ways a molecule is represented — a sketch on a whiteboard, an instrument readout, a database query string, the notation buried in a patent — is slow, specialised, and impossible to keep up with at scale.
Strategic Reality: The bottleneck here is not a shortage of cleverness. It is the volume of routine translation, recall and cross-checking that sits between a chemist’s judgement and a confirmed result. That is precisely the kind of work that determines how fast an R&D pipeline moves — and precisely the kind that has been hard to automate.
AI has been described as transformative for chemistry for years, yet adoption has lagged. Machine-learning tools for retrosynthesis and reaction prediction have existed for a decade, but the data they depend on is sparse, inconsistently formatted, and locked behind subscription journals. The telling detail is that the average academic or small-lab chemist still does not use them. The capability existed; the practical fit did not.
What actually happened in the test
Anthropic measured three of its models against two industry-standard packages, ChemDraw and MestReNova, on 20 compounds pulled from chemistry preprints published after the models’ training cut-off — a deliberate guard against the models having memorised the answers. The compounds spanned four structural families, each chosen to pose a different kind of NMR challenge.
On the forward task — predicting where each hydrogen and carbon peak should fall from a known structure — the strongest model, Opus 4.7, came in at an average hydrogen error of ±0.079 ppm, well under half the tolerance a chemist would accept. On carbon it was effectively level with MestReNova (±1.37 against ±1.48 ppm). Where the gap opened up was on the finer features chemists read alongside peak position: the Claude models predicted the spacing between sub-peaks to within half a hertz around 80% of the time, against 26 to 35% for the specialist software.
Critical Context: The headline is not that AI matched the tools on their home turf, though it did. It is that a general-purpose model — no chemistry fine-tuning, no curated molecular database — drew level with software purpose-built for this single job over many years. The competitive line between horizontal AI and vertical specialist tools is thinner than it looked.
The more striking result came from running the problem backwards. Structure elucidation — proposing the molecule from its spectrum rather than the spectrum from its molecule — is the harder direction, and the one existing software hands back to the chemist. Given 15 elucidation problems, Opus 4.7 recovered all eight simpler structures on every attempt from the spectra and molecular formula alone. On the seven denser targets, supplied with a single hint about the starting material, it returned the correct structure on all three runs for four of them and on two of three runs for the rest.
| Capability | Specialist software today | What the test showed |
|---|---|---|
| Forward prediction (structure → spectrum) | The established strength | General-purpose model drew level on average |
| Peak shape and spacing | 26–35% within half a hertz | Claude models ~80% within half a hertz |
| Inverse elucidation (spectrum → structure) | Needs 2D NMR, licensed tools, specialist training | Done from a 1D peak list and formula, no setup |
| Auditable reasoning | Limited | Step-by-step working a chemist can check |
Why this is more than a chemistry result
The instinct for a non-specialist reader is to file this under “impressive lab demo” and move on. That instinct misses the strategic content, which lives in three details rather than the benchmark scores.
The first is that the data problem was never solved. Anthropic is explicit that the scarcity, inconsistency and paywalling that held chemistry AI back for years still exist. What changed is that today’s frontier models are multimodal and can reason explicitly — so they read a structure straight from a journal figure or a hand sketch, parse a methods section in the form it was actually published, and show their working. None of that fixes the data shortage. It changes which problems are tractable despite it. That is the more durable lesson for any technical field still waiting for clean datasets before it adopts AI: the wait may be the wrong strategy.
Competitive Reality: Dedicated structure-elucidation software has existed for decades, but it typically demands 2D NMR, licensed tools and specialist training. The frontier model did the job from the same high-resolution mass spectrum and 1D peak list a chemist would paste into a chat. When a capability that required a procurement decision and a trained operator collapses into “paste it into a conversation,” the access economics shift underneath the incumbent.
The second detail is the auditability. The model shows its reasoning step by step, which means the chemist can check the output rather than trust it. In regulated R&D — where a structural assignment may end up in a patent or a regulatory filing — that is not a nice-to-have. It is the feature that makes the tool usable at all, and it is worth naming because it points to where AI lands first in serious technical work: as a checkable assistant, not an oracle.
The third is the honesty of the claim. Anthropic frames its own conclusion as “modest” — that the model is starting to help with the daily translation, recall and integration work that complements a chemist’s judgement. Not replaces it. The benchmark beats the press release here: this is a human-in-the-loop result, and the organisations that read it as “AI does chemistry now” will draw the wrong operational conclusion.
Why it lands on UK organisations
The UK has an unusually large stake in this particular shift. Life sciences, pharmaceuticals, speciality chemicals and advanced materials are among the country’s strongest R&D sectors, spread across large players, university spin-outs and small labs. The adoption gap the paper describes — capable tools that the average small-lab chemist still does not use — is a British problem as much as anyone’s, and it is one general-purpose models are positioned to close, because they remove the setup, the licence and the specialist training that kept the earlier tools on the shelf.
| Stakeholder | What the shift means for them |
|---|---|
| R&D-intensive UK firms | A slice of expert translation work becomes faster and cheaper, but only with chemist oversight built in |
| Small labs and university spin-outs | Capability that needed licensed software and training becomes reachable from a chat interface |
| Specialist software vendors | A horizontal tool now competes on tasks that were a protected vertical niche |
| R&D and innovation leaders | The question moves from “can AI do this” to “which expert tasks do we re-engineer around it, and how do we govern that” |
Reality Check: This is a 20-compound study across four scaffold families — a strong signal, not a settled fact. Anthropic itself flags that it would want several hundred compounds across 20–30 structural classes, plus untested solvents and 2D experiments, before treating the numbers as robust. Read it as a direction of travel that warrants a pilot, not a procurement decision you make this quarter.
What UK R&D leaders should actually do
The response is not to rip out specialist software or to wait for a definitive benchmark that may be years away. It is to treat general-purpose models as a serious candidate for specific, checkable technical tasks, and to build the oversight that makes them safe to rely on.
Foundational — for organisations early in applying AI to technical work:
- Run a scoped pilot on a task you can verify. Pick a recurring, auditable step — spectral interpretation, literature reconciliation, notation translation — and benchmark a frontier model against your current tool on your own compounds, not someone else’s.
- Keep the expert in the loop by design. The result that matters is the auditable one. Build review into the workflow so the chemist checks reasoning, rather than the model issuing answers unchecked.
Developing — for organisations with AI already in R&D processes:
- Reassess specialist-tool spend against general-purpose capability. Where a horizontal model is drawing level, the licence renewal is now a question rather than a formality — but answer it with your own evaluation, not a vendor’s benchmark.
- Treat the data-readiness excuse with suspicion. If your team is deferring AI adoption until datasets are clean, test whether multimodal reasoning makes the task tractable now, despite the data gap.
Advanced — for organisations where R&D outputs carry regulatory or IP weight:
- Write provenance into the workflow. If a model contributes to a structural assignment that reaches a patent or filing, capture its reasoning trail as part of the record, not as an afterthought.
- Map where general-purpose AI changes your competitive cost base. If a rival small lab can now reach capability that previously needed your scale and licences, that erosion of advantage is a strategy question, not an IT one.
Implementation Note: Most of the value here is unglamorous workflow engineering — choosing the right checkable task, wiring in human review, capturing the reasoning trail. The organisations that benefit will not be the ones with the boldest AI narrative. They will be the ones that re-engineered three specific R&D steps and measured the result.
The challenges that will not be obvious
The benchmark-as-verdict trap. A clean result on 20 compounds is easy to over-read. The model’s accuracy on your scaffolds, your solvents and your edge cases is unknown until you test it. Mitigation: validate on your own chemistry before changing any process or spend.
Confident wrong answers. A model that shows fluent reasoning can be persuasive when it is wrong, and the weakest model in the test scattered its guesses badly on a single hard proton. Mitigation: design the workflow so the chemist checks the working, and treat the auditability as the control, not a courtesy.
Quiet capability drift. General-purpose models change between versions, and a result that holds on one release may shift on the next. Mitigation: re-run your validation on model updates, and avoid hard-wiring a process around behaviour you have not re-confirmed.
The adoption mirage. It is tempting to assume that because the barrier fell, the capability will be used. The earlier tools show otherwise — they existed and went unused for a decade. Mitigation: invest in the workflow and the training that turn an available capability into an adopted one, not just the tool.
The takeaway for decision-makers
The lasting signal from this work is not that one model is good at NMR. It is that a general-purpose frontier model, with none of the specialism the task supposedly required, has drawn level with decades-old dedicated software and reached past it on the harder, inverse problem — without solving the data scarcity that was meant to be the precondition. For UK organisations whose competitiveness runs through R&D, that is a notice that the line between generic AI and specialist technical tooling is moving, and moving in a direction that rewards early, careful experimentation.
Three things make that manageable. First, specificity: pilot on a single checkable task and measure it on your own chemistry, not a published benchmark. Second, oversight: keep the expert checking the reasoning, because the auditable answer is the usable one. Third, realism: read the vendor’s own “modest” framing as the accurate one — this is AI that complements expert judgement, and the organisations that treat it that way will get further than the ones that oversell it.
Take Action: Identify one recurring, verifiable technical task in your R&D workflow — the kind your specialists describe as necessary but tedious — and commission a four-week pilot benchmarking a frontier model against your current tool on your own data. The point is not to prove AI works. It is to learn, on tasks you can check, where it earns a place and where it does not.
The word that matters in this paper is “starting.” Frontier AI is starting to do serious technical work, under expert supervision, on problems that used to need specialist software and training. The organisations that prosper will be the ones that find out where that is already true for them — before their competitors do.
Source and attribution
This analysis draws on “Making Claude a chemist,” published by Anthropic on 5 June 2026, and the accompanying white paper on NMR spectral analysis. The original work reports the benchmark methodology and results; the strategic implications for UK R&D organisations are Resultsense’s own. Model performance figures are as published by Anthropic.
Resultsense provides UK-focused analysis of artificial intelligence developments for professionals and businesses. We translate technical and research shifts into the practical decisions that boards, leaders and teams actually face.