The International Network for Advanced AI Measurement, Evaluation and Science just published its first consensus document, and the eight principles it agreed on matter far less than the five questions it couldn’t answer. For UK organisations deploying AI systems, those open questions are where the real regulatory and commercial risk sits.
Ten nations, eight principles, and the hard part
In December 2025, representatives from ten jurisdictions gathered in San Diego during NeurIPS to hammer out common ground on how AI systems should be evaluated. The group — which includes Australia, Canada, the EU, France, Japan, Kenya, South Korea, Singapore, the UK, and the US — was formerly known as the International Network of AI Safety Institutes. The rebrand to focus on “measurement, evaluation and science” is telling. Safety is politically charged. Measurement sounds like something everyone can agree on.
And agree they did, on eight principles that read like reasonable engineering practice:
| Principle | What it means in practice |
|---|---|
| Clear objectives first | Define what you’re testing before you test it |
| Transparency and reproducibility | Document methods well enough that someone else could replicate them |
| Tailored parameters | Match evaluation design to whether you’re testing best-case or real-world performance |
| Embedded quality assurance | Build in checks like transcript analysis for scientific rigour |
| Separate reporting | Don’t bundle multiple evaluations into a single score |
| Accessible findings | Write for policymakers, not just researchers |
| Validity strengthening | Include uncertainty estimates and realistic scaffolding |
| Multilingual and multicultural testing | Recognise that safeguards perform differently across languages and cultures |
Strategic Reality: These eight principles are sensible but uncontroversial. The real signal is in what the network couldn’t agree on — that’s where regulation and commercial impact will concentrate over the next 18 months.
None of this is radical. Any organisation with a mature quality assurance function already does most of it. The interesting question is why it took ten governments to agree on what amounts to “test things properly and write it down.”
The answer: because the consensus is the easy bit. The disagreements are where the action is.
Five open questions that should concern every AI deployment team
The network surfaced five unresolved issues, and each one has direct implications for how UK businesses build, buy, and deploy AI systems.
Should evaluations use risk models that link measured characteristics to real-world harms? This is the most consequential question on the list. Right now, most AI evaluations measure capability — can the model do X? Linking those measurements to actual harm predictions is a different exercise entirely. If international standards move in this direction, organisations will need to demonstrate not just what their systems can do, but what damage they might cause. That changes procurement conversations overnight.
Critical Context: The distinction between capability evaluation and harm prediction is the fault line in global AI governance. UK businesses should watch this debate closely — the outcome determines whether AI compliance looks like a checklist or a risk assessment exercise.
How should evaluators prioritise limited resources across measures? Nobody has infinite testing budgets. This question is really asking: what matters most? If regulators settle on a priority framework, it becomes the de facto standard for what “good enough” evaluation looks like. Organisations that anticipate those priorities gain a head start on compliance.
What evaluation methods and information should be shared versus restricted? Some evaluation techniques, if published, could help bad actors game the tests. But keeping methods secret undermines the transparency principle the network just agreed to. This tension is unresolved and directly affects whether UK businesses can access the evaluation tools they need.
How flexible should standardised report templates be? Too rigid and they won’t fit diverse use cases. Too flexible and they become meaningless. The template question sounds bureaucratic, but it determines whether AI evaluation becomes a genuine quality signal or just another tick-box compliance exercise.
How do you evaluate complete AI systems rather than isolated models? This is perhaps the most technically difficult question. Most evaluation research focuses on foundation models in isolation. But businesses deploy systems — models wrapped in retrieval pipelines, guardrails, human review loops, and application logic. Evaluating the system is harder and more relevant than evaluating the model alone.
Implementation Note: If your organisation evaluates AI models in isolation before deploying them as part of larger systems, you’re likely measuring the wrong thing. Start documenting your system-level evaluation approach now, before standards catch up.
What the UK’s position tells us
The UK’s AI Safety Institute — now rebranded as part of this broader measurement network — has been positioning itself as a convener rather than a regulator. That’s a deliberate choice. By helping to set international evaluation standards, the UK gains influence over the rules without having to impose them domestically through legislation.
For UK businesses, this creates an unusual dynamic. The government is actively shaping international AI evaluation norms while maintaining a relatively light regulatory touch at home. The practical effect: UK companies face softer domestic requirements today but should expect those international standards to eventually flow into procurement criteria, industry standards, and possibly regulation.
Strategic Insight: The UK’s strategy is to influence global AI evaluation standards while keeping domestic regulation flexible. Smart UK businesses will track the international standards development and prepare early, rather than waiting for domestic legislation to force their hand.
The inclusion of Kenya and Singapore in the network also matters. These aren’t token additions — they represent an attempt to build evaluation standards that work across different economic and cultural contexts. If the network’s multilingual and multicultural principle gets teeth, it will raise the bar for AI systems deployed internationally, including those built by UK companies for global markets.
The evaluation maturity gap most organisations face
Here’s the uncomfortable truth: most UK organisations aren’t ready for any of this. A 2025 survey by the Alan Turing Institute found that fewer than 30% of UK businesses deploying AI had formal evaluation processes beyond basic accuracy testing. The gap between where international standards are heading and where most organisations sit today is significant.
The maturity spectrum looks something like this:
| Level | Current practice | What standards will expect |
|---|---|---|
| Basic | Test accuracy on a held-out dataset | Documented objectives, reproducible methods |
| Intermediate | Red-team testing, bias audits | Risk-linked evaluations, uncertainty estimates |
| Advanced | System-level evaluation, continuous monitoring | Standardised reporting, cross-cultural testing |
| Leading | Harm prediction models, independent verification | Full compliance with emerging international norms |
Most organisations sit at basic or intermediate. The international consensus is aimed at advanced and leading.
Reality Check: Closing this maturity gap won’t happen through a single compliance project. It requires building evaluation into your AI development lifecycle from the start, not bolting it on at the end.
Where this goes next
The network plans to meet at the India AI Impact Summit to test its principles against real-world applications. That meeting will be more revealing than the San Diego consensus, because applying principles to actual systems forces decisions on all five open questions.
Three things to watch:
First, whether the network produces guidance on harm-linked evaluation. If it does, expect that guidance to become the baseline for AI governance frameworks globally within 12 to 18 months.
Second, whether evaluation methods become more open or more restricted. The answer determines whether small and medium-sized businesses can evaluate their own systems or will need to rely on expensive third-party assessors.
Third, whether system-level evaluation standards emerge. If they do, every organisation running AI in production will need to rethink how they test and document their deployments.
Take Action: Review your current AI evaluation practices against the eight agreed principles. Where are the gaps? The organisations that start closing them now will face fewer surprises when these standards harden into expectations.
What to do about it
The temptation is to wait. These are principles, not regulations. Nobody is going to audit your AI evaluation process next quarter based on what ten countries agreed at an academic conference.
But that’s short-term thinking. Standards crystallise faster than most organisations expect. GDPR went from proposal to enforcement in four years. The EU AI Act moved from draft to law in three. International AI evaluation standards won’t take that long — the groundwork is already laid.
For organisations at basic maturity: Start documenting your evaluation objectives and methods. You don’t need a sophisticated framework yet, but you need to be able to explain what you test, why you test it, and how you report the results.
For organisations at intermediate maturity: Begin linking your evaluations to risk assessments. The open question about harm prediction will get answered eventually, and the answer will almost certainly require some form of risk-linked evaluation. Getting ahead of that curve is cheaper than catching up.
For organisations at advanced maturity: Engage with the standards process. The UK’s AISI is actively seeking input from industry. Organisations that help shape the standards will be better positioned to comply with them.
Strategic Insight: The cost of preparing for international AI evaluation standards is low today and will only increase. Treat this as an investment in regulatory readiness, not a compliance burden.
The bigger picture
International AI evaluation standards are part of a broader shift from “move fast and deploy” to “prove it works safely.” That shift is happening whether individual organisations want it or not.
The San Diego consensus is modest in scope but significant in direction. Ten governments agreeing on anything related to AI governance is worth paying attention to. The eight principles provide a reasonable baseline. The five open questions map the territory where the real fights will happen.
UK businesses have an advantage here. The UK government is at the table, actively shaping these standards. The domestic regulatory environment remains relatively open. And the evaluation maturity gap, while real, is smaller than in many other markets.
The question is whether UK organisations will use that advantage or squander it by waiting for someone to tell them what to do.
Source: International AI Network: consensus and open questions, UK AI Safety Institute, 12 February 2026.
Analysis by Resultsense. For more on AI governance developments affecting UK businesses, explore our insights and news coverage.