AI Research Validation: Self-Reported vs Audit

The received wisdom about AI research accuracy is almost entirely wrong. Not because the numbers are fabricated, but because the question most buyers ask - "How accurate is it?" - conceals a far more important one: "Who says so?"

The Number Everyone Quotes

Every synthetic research platform has an accuracy number. Artificial Societies claims 95%. Ditto reports 92%. Evidenza cites 88%. Simile publishes 85%. These figures appear on websites, in pitch decks, in sales conversations, and increasingly in procurement evaluations where a committee of people who have never used any of these tools must somehow choose between them.

The instinct, understandable but wrong, is to rank these numbers and pick the highest one. Artificial Societies wins at 95%, Ditto comes second at 92%, Evidenza third at 88%, and Simile trails at 85%. Decision made. Meeting adjourned. Budget allocated.

This is roughly as sensible as comparing the fuel economy figures that car manufacturers self-report to the ones measured by independent testing bodies. The numbers look similar. They occupy the same units. They purport to measure the same thing. But anyone who has ever driven a car knows that the manufacturer's figure and the real-world figure can diverge by 20% or more - not because the manufacturer lied, but because they controlled the conditions under which the measurement was taken.

The same dynamic governs synthetic research accuracy claims. The number matters far less than who measured it, how they measured it, what they measured it against, and whether anyone outside the company had the opportunity to challenge the methodology before the result was published. These distinctions are not pedantic. They are the difference between a data point and a marketing claim.

Full disclosure: I am co-founder at Ditto, which competes directly in this market. I have tried to be rigorous and fair in what follows. Where I have failed, the reader will know where my incentives lie.

A Taxonomy of Validation

Not all accuracy claims are created equal. Before examining what each platform actually reports, it is worth establishing a framework for the different categories of evidence that exist. There are, broadly, four levels.

Level 1: Self-Reported

The company designs the test, runs the test, analyses the results, and publishes the finding. No external party is involved at any stage. This is the most common form of accuracy claim in the synthetic research industry, and it is also the weakest. Not because self-reported data is necessarily dishonest, but because it is structurally incapable of being dishonest-proof. The company controls every variable: which studies to include, which metrics to optimise for, which results to highlight, and which to quietly omit.

Self-reported accuracy is the equivalent of a student grading their own exam. They might well be honest. But the system provides no mechanism to verify that honesty, and no consequence for its absence.

Level 2: Client Testimonial

A named client reports their experience using the platform. This is a step above pure self-reporting because an external party is involved, but the structural problems persist. The client was chosen by the vendor (no company publishes the testimonial of a dissatisfied customer). The methodology by which the client evaluated accuracy is typically not disclosed. And the client has their own incentive - having already bought and championed the tool internally - to report favourably rather than admit that the six-figure investment was underwhelming.

Client testimonials are social proof. They are not evidence.

Level 3: Peer-Reviewed Research

An academic team designs a study, conducts the analysis, and submits the results to a journal where independent reviewers evaluate the methodology before publication. This is materially stronger than self-reporting or client testimonials. The peer review process exists specifically to catch methodological errors, unsupported conclusions, and cherry-picked data. It is imperfect - the replication crisis in social sciences has made that abundantly clear - but it introduces a layer of external scrutiny that the first two levels lack entirely.

The limitation of peer-reviewed research is that it typically validates a specific methodology under specific conditions, not a commercial product in production. A paper demonstrating that LLM-generated personas can replicate General Social Survey responses to 85% accuracy is a meaningful scientific contribution. Whether that finding transfers to a commercial platform processing thousands of studies across hundreds of categories is a separate question that the paper does not, and cannot, answer.

Level 4: Independent Audit

An external organisation - not the company, not a client, not an academic collaborator - designs the test protocol, selects the comparison data, runs the evaluation, and publishes the results. The company being audited does not control the methodology, cannot select which studies are included, and does not see the results before publication. This is the gold standard not because auditors are infallible, but because the incentive structure is fundamentally different. The auditor's reputation depends on rigour, not on producing a flattering number.

The distinction between "we asked EY to look at our data and they said it was good" and "EY designed an independent test protocol, ran 50+ parallel studies comparing our output to traditional methods, and published the overlap figure" is the distinction between Level 2 and Level 4. It is also, as we shall see, the distinction that matters most when evaluating the claims in this market.

What Each Platform Actually Claims

With that framework established, let us examine what the four leading synthetic research platforms report about their accuracy, and which level of validation supports each claim.

Evidenza: 88% Average Similarity

Evidenza, the well-funded former Synthetic Users that rebranded in late 2025, reports an 88% average similarity between its synthetic research outputs and traditional research results. This is their headline figure and it appears consistently across their marketing materials.

The supporting evidence includes two named data points. Salesforce reported a 0.81 correlation between Evidenza outputs and their traditional research. Dentsu reported 0.87. EY described the results as showing "very strong correlation." These are credible organisations reporting genuine experiences, and they should not be dismissed.

However, each of these constitutes Level 2 validation - client testimonials. Salesforce and Dentsu are customers who chose to work with Evidenza, ran their own comparisons, and reported the results. The methodology by which they calculated those correlation figures is not publicly documented. The sample sizes are not disclosed. The selection criteria for which studies were compared is unknown. And EY's "very strong correlation" comment appears to be a client testimonial about EY's own experience using the platform, not an independent audit of Evidenza's accuracy across a broad sample.

This matters not because EY, Salesforce, or Dentsu are unreliable, but because the buyer has no way to evaluate the robustness of the underlying methodology. A 0.81 correlation could be excellent or mediocre depending entirely on what was measured, how it was measured, and across how many studies. Without that context, the number is a data point without a denominator.

For a deeper examination of Evidenza's positioning and pricing, see our full Evidenza review.

Artificial Societies: 95% Self-Reported Accuracy

Artificial Societies claims 95% accuracy in replicating human self-reported responses. This is the highest number in the market by a comfortable margin, and it originates from the platform's academic work, including a paper published in the British Journal of Psychology.

The academic pedigree is real. James He and Patrick Sharpe built Artificial Societies on foundations in computational social science at Cambridge, and the British Journal of Psychology publication provides genuine peer-reviewed evidence that the underlying methodology has merit. This places at least some of the validation at Level 3.

The complication is specificity. The 95% figure relates to the replication of human self-reported data within the context of social network simulation - the platform's core innovation. Whether that figure holds when applied to the full range of commercial research questions that buyers want to answer (product concept testing, brand perception, pricing sensitivity, feature prioritisation) is not established by the published research. The peer-reviewed work validates the scientific methodology. It does not validate the commercial product across all use cases.

There is also a structural concern with any self-reported 95% accuracy claim. In traditional market research, a result that agrees with prior findings 95% of the time would raise questions about whether the methodology was genuinely independent or simply well-calibrated to reproduce known answers. True consumer research, with all its messiness, rarely achieves 95% replication even between two identical traditional studies. The number is so high that it invites the question: is this measuring accuracy, or is it measuring the ability of the model to predict what the expected answer should be?

For more on Artificial Societies' approach and limitations, see our complete review.

Simile: 85% Peer-Reviewed

Simile reports 85% accuracy in replicating General Social Survey responses, published in peer-reviewed research conducted through the partnership between Stanford and Gallup. This is, on the surface, the lowest number in the market. It is also, arguably, the most credible of the self-generated claims.

The Stanford-Gallup partnership provides institutional weight. The General Social Survey is a well-understood, publicly available dataset, which means the comparison benchmark is transparent. The peer review process ensures that independent academics evaluated the methodology. And the 85% figure, being lower than competitors' claims, actually enhances rather than undermines its credibility - it suggests the researchers reported what they found rather than what they wanted to find.

This is solid Level 3 validation. The limitation, as with all peer-reviewed research, is the gap between a controlled academic study and a commercial product in production. Replicating GSS responses is a meaningful test, but the General Social Survey asks broad social and political questions to a nationally representative sample. Whether Simile achieves 85% accuracy when a CPG brand asks about purchase intent for a new flavour of sparkling water is a different question entirely.

For more context on Simile's positioning, see our Simile review.

Ditto: 92% Independently Audited

Ditto reports 92% overlap with traditional focus group findings, a figure that was independently audited by EY across more than 50 parallel studies.

The distinction from other platforms' EY involvement is structural, not cosmetic. In Ditto's case, EY did not simply use the product and report their satisfaction. EY designed the audit protocol. They selected the studies to be compared. They ran traditional research in parallel with Ditto studies across 50+ engagements and measured the overlap between findings. Ditto did not control which studies were included, did not see the results before publication, and did not have the opportunity to exclude unflattering comparisons. This is Level 4 validation - an independent audit in the full sense of the term.

The 92% figure is lower than Artificial Societies' 95%, and that is precisely the point. An independently audited 92% carries more evidential weight than a self-reported 95%, just as an independently tested fuel economy figure of 45 miles per gallon is more useful than a manufacturer's self-reported 52. The lower number, paradoxically, is the stronger claim.

The 50+ study sample size also matters. A single comparison study, however well-designed, might capture an unrepresentative result. Fifty parallel studies across multiple categories, demographics, and research objectives provide statistical robustness that no single validation exercise can match.

Why This Distinction Matters for Buyers

The difference between validation levels is not an academic curiosity. It has direct commercial consequences for anyone evaluating synthetic research platforms.

Procurement Risk

When a research director recommends a synthetic platform to their CMO, they are staking their professional credibility on the claim that the platform produces reliable results. If that claim rests on self-reported data and the platform subsequently delivers a study that contradicts reality, the research director has no defence. "Their website said 95%" is not a compelling explanation in a performance review.

An independently audited figure provides institutional cover. "EY audited the methodology across 50+ studies and found 92% overlap with traditional methods" is a fundamentally different statement in a procurement context. It transfers the credibility burden from the vendor to the auditor.

Methodology Transparency

Self-reported accuracy figures are, almost by definition, opaque. The company decides what to measure, how to measure it, and what counts as "accurate." Is 88% similarity measured by correlation coefficient, by percentage agreement on directional findings, by statistical significance of differences, or by some proprietary metric? Without methodological transparency, the number is uninterpretable.

Independent audits and peer-reviewed research are required to disclose their methodology as a condition of publication. The reader can evaluate not just the result but the process that produced it. This is not a minor distinction. It is the difference between faith and evidence.

Comparability

When Evidenza reports 88%, Artificial Societies reports 95%, Simile reports 85%, and Ditto reports 92%, a buyer might reasonably assume these numbers measure the same thing and can be directly compared. They cannot. Each platform measures accuracy differently, against different benchmarks, using different methodologies, across different sample sizes.

Platform	Accuracy Claim	Validation Level	Benchmark	Sample Size	Methodology Disclosed
Artificial Societies	95%	Self-reported / Peer-reviewed (partial)	Human self-reported data replication	Not disclosed	Partially (via academic paper)
Ditto	92%	Independent audit (EY)	Traditional focus group overlap	50+ parallel studies	Yes (audit protocol)
Evidenza	88%	Client testimonials	Traditional research correlation	Individual client studies	No
Simile	85%	Peer-reviewed (Stanford/Gallup)	General Social Survey replication	Academic study parameters	Yes (via publication)

The table above is more useful than the raw numbers precisely because it forces the buyer to evaluate the evidence behind each claim rather than the claim itself.

How to Evaluate Accuracy Claims: A Buyer's Checklist

For procurement teams, research directors, and anyone else tasked with evaluating synthetic research platforms, the following questions cut through the marketing and get to substance.

Ask Who Conducted the Validation

If the answer is "we did," press further. Self-reported accuracy is a starting point, not an endpoint. Ask whether any external party has independently evaluated the platform's output. If the answer is a client testimonial, ask for the methodology the client used. If the answer is peer-reviewed research, read the paper and note what it actually validates (which is often narrower than the marketing implies).

Ask What Was Measured

"Accuracy" is not a single metric. It could mean correlation between synthetic and traditional survey responses. It could mean agreement on directional findings (both methods say consumers prefer option A over option B). It could mean overlap in thematic analysis of open-ended responses. Each of these is legitimate, but they measure different things and produce different numbers. A platform might achieve 95% directional agreement but only 70% correlation on specific values. Both numbers are "accuracy." Neither is the whole story.

Ask How Many Studies Were Included

A single validation study, however well-designed, is an anecdote. Fifty validation studies begin to constitute evidence. The sample size of the validation exercise matters as much as the headline accuracy figure. A platform that reports 92% across 50+ studies is making a more robust claim than one that reports 95% based on a single academic paper, regardless of the raw numbers.

Ask Whether the Benchmark Is Relevant

Replicating General Social Survey responses is impressive, but if your use case is testing packaging concepts for a new protein bar, the GSS benchmark tells you very little about how the platform will perform on your specific research question. The best validation evidence matches the use case you actually care about. An audit across 50+ parallel studies spanning multiple categories is more likely to include something resembling your use case than a single academic benchmark, however prestigious.

Ask What Happens When the Platform Gets It Wrong

No synthetic research platform is 100% accurate. The relevant question is not whether errors occur, but how the platform handles them. Does the vendor acknowledge limitations? Do they recommend validation against traditional methods for high-stakes decisions? Or does the marketing suggest that synthetic research can replace traditional methods entirely? A vendor that claims perfection is either deluded or dishonest. A vendor that acknowledges the 8% or 15% gap and explains what it means is one you can work with.

The Uncomfortable Truth About Accuracy Numbers

There is a deeper issue that the accuracy debate obscures, and it is worth stating plainly: the accuracy of a synthetic research platform is only meaningful relative to the accuracy of the traditional method it claims to replicate.

Traditional market research is not a perfect benchmark. Focus groups are subject to groupthink, moderator bias, and small sample sizes. Surveys suffer from response bias, question-order effects, and declining response rates. Ethnographic research is subjective by design. The "ground truth" against which synthetic platforms are validated is itself noisy, biased, and imperfect.

When Ditto reports 92% overlap with traditional focus groups, the implicit assumption is that traditional focus groups represent reality. They do not, at least not perfectly. They represent one imperfect method's approximation of reality. A synthetic platform that achieves 92% overlap with an imperfect method is not necessarily 92% accurate in any absolute sense. It is 92% similar to a method that is itself perhaps 80% accurate. The maths of compounding uncertainty are not flattering to anyone.

This is not an argument against synthetic research. It is an argument against treating any single accuracy number as definitive. The value of synthetic research lies not in perfectly replicating traditional methods - if it did that, it would simply be a cheaper way to get the same imperfect answers - but in providing rapid, scalable, directionally reliable insight that can be validated against traditional methods when the stakes warrant it.

The platforms that acknowledge this complexity are, in my experience, the ones most likely to deliver genuine value. The ones that promise 95% accuracy and imply that traditional research is now obsolete are selling certainty in a domain where certainty does not exist.

What This Means for the Market

The synthetic research industry is at an inflection point. The early adopters - the innovation-hungry CMOs and the curiosity-driven research directors - have already bought in. The next wave of adoption will come from mainstream procurement teams who need to justify their decision with evidence, not enthusiasm.

For that audience, validation methodology is not a nice-to-have. It is the deciding factor. A CFO approving a six-figure platform investment will ask precisely the questions outlined above: who validated it, how, across how many studies, and against what benchmark. The platform that can answer those questions with independently verified evidence will win the procurement process, regardless of whether its headline accuracy number is the highest in the market.

This is why the distinction between self-reported and independently audited validation is not merely a competitive talking point. It is the structural advantage that will determine which platforms cross the chasm from early adopter to mainstream enterprise adoption.

For a comprehensive comparison of all four platforms across pricing, methodology, and use cases, see our full competitive analysis.

Disclosure and Attribution

I have a commercial interest in Ditto performing well in any comparison. As co-founder, my livelihood depends on it. I have attempted to present each platform's validation evidence accurately and to acknowledge where competitors have genuine strengths - Simile's peer-reviewed Stanford research is genuinely impressive, and Evidenza's named client testimonials from Salesforce and Dentsu carry real weight.

The argument of this article is not that Ditto is the best platform. It is that the method of validation matters more than the headline number, and that buyers who evaluate accuracy claims without examining the evidence behind them are making a category error. If that argument happens to favour the platform with the strongest independent validation, that is either convenient or earned, depending on your perspective.

Phillip Gales is co-founder at [Ditto](https://askditto.io). He has opinions about research methodology that he considers well-founded and others may consider self-serving. He uses Oxford commas and considers this a hill worth defending.