AI accuracy is a procurement question: a diligence framework for UK enterprise buyers

Perplexity published a rare technical essay last week describing how it post-trains frontier models to behave inside its own product. Most readers will have skimmed it as vendor marketing. UK enterprise buyers should read it as a procurement script. The piece quietly lists the engineering choices that separate a working answer engine from what Perplexity itself calls “a wrapper setup with model and web access”, and these distinctions are exactly the ones procurement teams are not asking suppliers to evidence before signing AI contracts. That gap, between accuracy as a claim and accuracy as a verifiable engineering practice, is where most disappointing UK enterprise pilots end up.

The accuracy claim every vendor makes — and almost no procurement form tests

Every AI vendor selling into UK enterprise tells a similar story. The model is grounded. It cites sources. It hallucinates less than the alternative. Hardly any of these claims arrive with the kind of evidence that a compliance team would recognise from any other category of regulated software. There is no equivalent of the SOC 2 report, the ICO data-protection impact assessment, or the OWASP application security checklist for an AI accuracy claim, at least not yet. So the buyer takes the demo on trust, and the procurement form moves on to commercial terms.

Strategic Reality: When Perplexity’s own engineers describe what makes their product work, they distinguish between a “wrapper setup with model and web access” and a system post-trained for evidence gathering, multi-source synthesis, format-stable accuracy, rubric-based open-ended evaluation, and search discipline. Most enterprise AI tools currently being sold into UK organisations are wrappers, sold using the language of the post-trained product. Buyers who do not know the difference end up paying premium prices for the cheaper engineering.

Critical numbers — the diligence gap

Procurement question	Estimated frequency in UK enterprise AI RFPs
Cite three test sets you measure model accuracy against in production	Under 10%
Demonstrate behaviour on multi-source synthesis questions, not lookups	Under 5%
Show rubric-based evaluation methodology for open-ended outputs	Under 5%
Provide accuracy degradation rates across response formats	Under 5%
Describe your search-discipline policy and when the model stops searching	Almost never
Show your accuracy regression process for model swaps and version updates	Under 15%

Estimates are drawn from Resultsense observations across UK enterprise AI procurement engagements through 2025 and early 2026. Even allowing generous error bars, the central observation is robust: the most consequential accuracy questions are the ones procurement is not asking.

What Perplexity’s two-stage training reveals about wrapper economics

Buried in the Perplexity essay is a specific architectural claim. The team trains its underlying models in two stages. First, the model learns how to behave inside the product, including how to follow instructions, stay consistent, and use tools properly. Then it is trained on harder search tasks so it gets better at finding evidence, using it well, and answering more efficiently. The sequencing is presented matter-of-factly. It is also the difference between Perplexity and a competitor that bolts a model behind a web-search API and ships.

The economic point matters for UK procurement. Post-training a frontier model takes a research team, an evaluation infrastructure, and weeks of compute. A wrapper takes a long weekend. Both can be marketed as “search-augmented AI”. One has compounded the engineering investment that produces consistent accuracy; the other has not. In a properly mature enterprise software market, the price difference between those two products would be a factor of ten. In the current AI market, it is often smaller than the difference between annual and monthly billing.

Three things follow from this for UK buyers.

First, accuracy is a discipline, not a feature. Perplexity’s piece lists at least five ongoing engineering practices: staged post-training, multi-source synthesis training, rubric-based evaluation, format-stable training, and search discipline. Any vendor who talks about accuracy as a property the model “has” rather than as something the team continuously builds is, almost by definition, doing wrapper economics. The right diligence question is not “how accurate is your model” but “what is your team currently measuring, and what changed in the last release”.

Second, accuracy generalises poorly across output formats. Perplexity is explicit that they train across paragraph, list, and table outputs to keep accuracy stable. Most enterprise users will only see one or two formats during a vendor demo. A model that hits 94% accuracy on paragraph answers may hit 71% on tables. UK buyers running pilots should request the same task evaluated across the formats they will actually use in production.

Critical Context: A vendor demo that only shows paragraph responses has not demonstrated accuracy on the formats your finance team, your operations team, or your compliance team will actually use. Insist on the same input, evaluated in the format your team will consume, before commercial terms are agreed.

Third, accuracy and helpfulness are not the same thing. The most striking single line in the Perplexity essay is “an answer has to be correct before it gets credit for being more helpful or better written”. This describes a deliberate engineering choice: gate the helpfulness reward on factual correctness, or the model optimises for “answers that sound better without actually being better”. This failure mode is the one most likely to slip past a procurement evaluation. Human reviewers grading vendor demos will reliably prefer the polished-but-wrong answer to the awkward-but-right one. Without an accuracy gate inside the vendor’s own training pipeline, every vendor will optimise for the demo and disappoint in production.

Who owns AI accuracy inside your organisation?

The reason these diligence questions go unasked is rarely technical. It is structural. AI tool procurement in most UK enterprises sits across a gap that no single function fully owns.

Stakeholder	What they currently test	What they don’t test	Result
Procurement	Commercial terms, contract length, SLAs	Accuracy methodology, evaluation rubrics	Vendor signed before accuracy is verified
Information security	Data residency, encryption, access controls	Output reliability, factual correctness	Compliant pipe carrying unreliable content
Data and analytics	Model architecture, integration, latency	User-facing answer quality across formats	Working system, disappointing output
Business sponsor	Demo experience, headline use case	Edge cases, failure modes, format stability	Excited buy, confused rollout
Compliance and legal	Liability terms, audit rights	Whether the audit rights are exercisable	Right to audit something nobody can verify

The success criterion is not adding a new function. It is making one existing function accountable for the accuracy diligence step before contract signature. In Resultsense’s experience the most workable owner is the data and analytics function, because they already operate evaluation infrastructure for internal models. Asking them to apply the same eval discipline to a vendor candidate is a small extension. Asking procurement or legal to do it is a category error.

Implementation Note: The accuracy diligence step should be a named, time-boxed deliverable in the procurement workflow, not an additional column on an existing scorecard. Allocate two to four weeks. Budget for the data and analytics team’s time. Treat it as equivalent in importance to the security review.

Six diligence questions that separate post-trained AI from a wrapper

Here is the framework UK enterprise buyers can apply to any AI vendor making accuracy claims, drawn directly from the engineering distinctions Perplexity made public. Each question has a follow-up that exposes the difference between a real answer and a marketing answer.

1. What is your accuracy methodology, and what changed in your last model update? A real answer names the evaluation framework, references a shareable accuracy regression report, and points to a specific change-log entry. A wrapper answer offers an aggregate accuracy figure with no methodology and no version history.

2. How do you handle multi-source synthesis versus single-source lookup? A real answer presents separate evaluation tracks for each, with measurable performance on both. A wrapper answer asserts that the model “uses multiple sources when needed” without any breakdown of how that is measured.

3. What is your rubric-based evaluation methodology for open-ended outputs? A real answer shares rubrics for the major task types your organisation will use, such as summarise, draft, plan, and explain, with measured rubric-pass rates. A wrapper answer refers to user thumbs-up rates or NPS, which measure preference rather than correctness.

4. Show me your accuracy stability across output formats: paragraph, list, table, structured JSON. A real answer is a four-format evaluation on the same task with deltas published. A wrapper answer is a paragraph-only demo and an offer to follow up.

5. What is your search-discipline policy? A real answer describes when the model stops searching and includes a measured cost-per-correct-answer figure. A wrapper answer is silence, or an indirect reference to “comprehensive search”.

6. What is your accuracy regression process when you swap or upgrade the underlying model? A real answer is a documented regression suite that runs on every model swap, with a no-regression deployment gate. A wrapper answer is “we test the new model before deploying” without further specification.

By organisational maturity

For UK organisations early in AI adoption, with no existing evaluation capability: pick two of the six questions. Questions 1 and 4 typically yield the highest information per minute spent. Insist on written answers, not slide demos. If a vendor cannot answer either in writing, treat it as a procurement signal in the same category as a missing SOC 2 report.

For UK organisations with mature data and analytics capability: run all six questions, plus a parallel evaluation on your own held-out test set drawn from production data. Make commercial close conditional on the parallel evaluation passing a defined threshold. The threshold itself matters less than agreeing it before the vendor sees the test set.

For UK regulated sectors (financial services, healthcare, public sector): add a seventh question — how do your accuracy guarantees survive a regulator audit? The current answer for almost every vendor is “they don’t, because we don’t make accuracy guarantees in the contract”. That itself is the diligence finding.

Take Action: Before your next AI vendor renewal or signature, ask question 1 in writing and require a written response. Use the response quality as the gating signal for whether to ask the other five.

Four ways the diligence framework still fails

Even buyers who run all six questions will hit four predictable obstacles. These are worth flagging in advance because the failure modes can look like vendor non-cooperation when they are usually structural.

Vendors don’t have answers because they’re not the model author. Most enterprise AI vendors are themselves wrappers around OpenAI, Anthropic, or Google. They genuinely do not know the post-training methodology, because they did not do the training. The honest version of this answer is “we rely on the underlying model provider’s accuracy work, and here is how we measure end-to-end accuracy on top of that”. The dishonest version conflates the two. Buyers should treat the honest version as acceptable and price-correctly; treat the dishonest version as disqualifying.

Mitigation: add a sub-question. Are you the trainer, the post-trainer, or the integrator of the model used in your product? The answer determines which of the six questions the vendor can plausibly answer at depth.

The test set leaks. Held-out evaluations only work if the vendor cannot see the test data. Once a vendor knows your test cases, future model updates will be tuned against them. UK buyers running parallel evaluations should rotate test sets across vendors and across review cycles, and treat the test set as confidential procurement data with the same handling as commercial terms.

Mitigation: maintain at least three independent test sets per use case, share at most one with any given vendor, and rotate quarterly.

The accuracy answer arrives in language buyers cannot audit. Vendors will reasonably respond in the technical vocabulary of the field: retrieval precision, k-shot evaluation, chain-of-thought consistency, RAG architecture. UK procurement teams without AI literacy will either approve answers they cannot evaluate or reject answers they do not understand. Both are wrong outcomes.

Mitigation: have one person in the data and analytics function review every vendor’s accuracy response and translate it to a procurement-readable risk rating. This is a one-day job per vendor. Skipping it produces decisions made on vocabulary, not on substance.

Accuracy degrades silently after deployment. A vendor who passes all six questions on signature day can still ship a model update three months later that quietly degrades accuracy on your specific use case. UK buyers should treat accuracy as something to monitor in production, not something to verify once.

Mitigation: build a small monthly accuracy regression check into the vendor management process. It can be as light as 50 sample queries, scored by a rubric, run by the same person who reviewed the original procurement answer. The cost is minimal. The signal it provides is the difference between catching a regression at week three and catching it at month nine.

Hidden Cost: The cost of an undetected accuracy regression is rarely the AI tool itself. It is the downstream rework when teams discover that decisions, drafts, or analyses produced over the past quarter were based on quietly degraded output. UK buyers should size that downstream rework cost into their accuracy monitoring budget.

Treat accuracy like security: a continuous practice, not a tick-box

The single transferable lesson from Perplexity’s essay is that accuracy in production AI is not a model property. It is a practice, a set of disciplined choices about what to train, what to evaluate, and when to stop searching. UK enterprise buyers who internalise that framing will procure AI better than buyers who continue to treat accuracy as a feature label on a slide.

Three success factors for UK organisations putting this into practice:

Make one function accountable. The data and analytics team, or whichever function owns evaluation infrastructure, should own the AI accuracy diligence step inside procurement. Without a named owner, the questions go unasked.

Treat the diligence answers as evidence, not collateral. The vendor’s responses to the six questions are the procurement record of accuracy claims at the moment of signature. Keep them. Re-test against them. Cite them in any subsequent dispute.

Monitor in production, not only in evaluation. A vendor who passed the diligence questions on signature day is a vendor whose claims need re-verification each quarter against the same questions. Build the monitoring into vendor management as routine, not as exception.

Next-steps checklist for UK enterprise buyers

Add the six diligence questions to your AI procurement scorecard before your next vendor decision
Identify the named owner for AI accuracy diligence inside your organisation
Allocate two to four weeks for the diligence step, separate from security and commercial review
Build at least three independent held-out test sets per AI use case
Schedule quarterly accuracy regression checks for every deployed AI tool
Re-run the diligence questions at every vendor renewal and after every model upgrade

Source citation and attribution

This analysis draws on Perplexity’s published technical essay, How Perplexity Builds Accuracy into Frontier AI, by the Perplexity Team, 22 April 2026, available at perplexity.ai/hub/blog/how-perplexity-builds-accuracy-into-frontier-ai. The procurement diligence framework is original Resultsense work derived from those engineering distinctions.

For UK enterprise leaders, Resultsense provides ongoing analysis of AI vendor capability, procurement diligence, and operational risk. Visit insights for further articles, news for daily UK AI coverage, or contact for direct engagement.

The accuracy claim every vendor makes — and almost no procurement form tests

Critical numbers — the diligence gap

What Perplexity’s two-stage training reveals about wrapper economics

Who owns AI accuracy inside your organisation?

Six diligence questions that separate post-trained AI from a wrapper

By organisational maturity

Four ways the diligence framework still fails

Treat accuracy like security: a continuous practice, not a tick-box

Next-steps checklist for UK enterprise buyers

Source citation and attribution

Share this article

Real AI or just a rebrand? A four-question test for UK procurement and investment teams

The Mythos signal: why AI scarcity is about to reshape enterprise procurement

Britain's middle-power AI bet: what sovereignty means for vendor strategy

The accuracy claim every vendor makes — and almost no procurement form tests

Critical numbers — the diligence gap

What Perplexity’s two-stage training reveals about wrapper economics

Who owns AI accuracy inside your organisation?

Six diligence questions that separate post-trained AI from a wrapper

By organisational maturity

Four ways the diligence framework still fails

Treat accuracy like security: a continuous practice, not a tick-box

Next-steps checklist for UK enterprise buyers

Source citation and attribution

Share this article

Related Articles

Real AI or just a rebrand? A four-question test for UK procurement and investment teams

The Mythos signal: why AI scarcity is about to reshape enterprise procurement

Britain's middle-power AI bet: what sovereignty means for vendor strategy