The cleanest data point on AI trust failure is from a study where the model never failed. BCG, Harvard and Wharton handed 758 consultants a task with a deliberately poisoned brief. Without GPT-4, 84% got it right. With GPT-4, 60 to 70% got it right. Same model, worse outcome, because the surface gave them no reason to slow down on the one task where they should have (Mollick, 2023).
That gap is the whole post. The model didn't lie. The interface didn't make doubt easy. Most trust failures we see on client audits are not model failures dressed up. They are UX failures dressed up as model failures. Four design moves close the gap on every system we ship into an established workflow:
- Citations on every claim.
- Calibrated confidence that predicts behaviour.
- Escape hatches by default.
- Drift monitors the user can see.
The surface owns the user's calibration, not the model
Ethan Mollick's framing is the cleanest available: AI capability sits inside a "jagged frontier". For some tasks the model is superhuman. For tasks one step over, it is confidently wrong. The user cannot tell from the output which side they are on.
That is a surface problem, not a model problem. Whatever the weights do, the interface decides whether the user develops calibrated trust or learned helplessness. Google Research ran the empirical version in 2024 with 76 software engineers using Bard. Two findings should haunt anyone shipping AI into a daily workflow:
- Automation complacency grew with time on task. Engineers became more deferential to the AI the longer they used it, not more critical.
- Self-reported satisfaction diverged from measured productivity. Users said the tool was great while it was making them slower.
The lesson is uncomfortable for product teams. The trust signal you collect from a survey lies. The trust signal you collect from behaviour (accept, reject, edit) is real. And the longer your surface goes without designed friction, the further your users drift from calibrated use. We covered the production-side instrumentation in our checklist for AI in production; this post is about what the user sees that produces those events in the first place.
Move one: citations on every claim, not as a footnote
The first move is the one most teams think they have already shipped. They have not. Most "cited" AI outputs are a numbered footnote pointing to a document, not a sentence. The user cannot get from the claim to the span in under ten seconds, so they do not bother, so the citation is decorative.
Anthropic shipped the Citations API in January 2025 as a fix for exactly this. Developers add source documents to the context; Claude returns a response where each claim is tagged with the exact sentences in the source it came from. The cited spans are guaranteed pointers into the supplied documents, not regenerated text the model could have made up.
Simon Willison's framing of why this matters is the one we use with clients:
— Simon Willison, on Anthropic's Citations API (2025)Citations are a form of fact-checking: the user can confirm that the quoted text did indeed come from those documents.
Trust comes from the user's ability to verify, not from the model being right more often. A surface that hands the user a verification path on every claim teaches them where the frontier is. A surface that does not, does not.
The product evidence is in the same write-up. Endex, an Anthropic customer, took source-hallucination and formatting errors from around 10% to 0% after wiring up Citations. The model did not get smarter. The surface did.
If your "citation" cannot move the user from claim to span in one click, you have not shipped a citation. You have shipped a footnote.
Move two: calibrated confidence that means something
Most confidence scores in production AI surfaces are theatre. A softmax exported as a percentage, a five-star widget, a coloured dot. Users learn to ignore them within a week because the number is not predictive of anything they care about.
A confidence signal "means something" when it predicts the user's accept, reject and fix ratio. If responses tagged 90% confident get accepted at the same rate as responses tagged 70% confident, the score is decoration. The Google Research finding is the test: log take-it / leave-it / fix-it per response, then bucket by the confidence you reported. If the curve is flat, your confidence label is lying.
Two design rules fall out of this.
First, calibrate by class of failure, not by global score. A retrieval-grounded answer with a span citation is structurally high confidence: you can show your work. The same model answering from training data with no retrieval is structurally low confidence regardless of how fluent it sounds. These two states should not produce the same number. They should produce different surfaces. Span-grounded answers render with a green source chip. Ungrounded answers render with a visible no source, model knowledge only badge and a different background colour. The user learns the difference in the first session.
Second, surface confidence per claim, not per response. The jagged frontier lives at the sentence level. A four-paragraph response can be three paragraphs of well-grounded summary and one paragraph of confident invention. A single response-level score hides exactly the thing the user needs to see (Mollick, 2023). Per-claim confidence is more work to compute and worth it on any surface where the wrong answer is expensive.
Move three: escape hatches by default
Anthropic's own documentation on hallucination reduction has a primary recommendation that almost no production prompt follows: explicitly permit the model to say "I don't know". Anthropic reports that this permission "drastically" reduces false information, and treats admissions of uncertainty as a target output category, not a failure mode.
Most prompts we audit do the opposite. They instruct the model to "always provide a helpful answer", because the product manager treated "I don't know" as a missed conversion. That instruction is a hallucination factory. It tells the model that fabrication is preferred to abstention. Users then absorb the fabrication at a rate determined by the surface, not the model.
The surface needs the same affordance the prompt does. Every response carries:
- A one-click
this looks wrongpath. - A one-click
hand off to a humanpath. - Both unpenalised in telemetry.
If your dashboards count handoffs as failures, your team will quietly remove the button. We have seen this happen on three of our last five audits.
This is where the Google Research data bites hardest. Automation complacency grew with time on task. Without a designed friction surface, users stop reading the output. The escape hatch is not a fallback for the day the model breaks. It is the daily friction that keeps the user verifying. "I don't know" is the single most pro-trust thing the surface can say. Ship the button that lets the system say it.
Move four: drift monitors users can see
The fourth move is the one teams reliably postpone. They build the model, ship the surface, set up an on-call channel, and call it done. Drift detection lives on a Grafana board no user will ever see. Three months later, accept rates have slid 12 points and nobody noticed until a customer complained.
Anthropic publishes how they handle this at the scale of the deployed Claude surface. Real-time classifiers (themselves fine-tuned Claude models) monitor conversations for policy violations and steer responses live. Hierarchical summarisation surfaces account-level concerns that no single message reveals. Privacy-preserving clustering of traffic patterns refines protections as usage shifts. Crucially, the backend monitoring is paired with public transparency surfaces: system cards, threat reports, a bug bounty.
The use case there is abuse monitoring, not accuracy drift. The pattern generalises. Ship the drift dashboard to the user, not just to on-call. Something like this, rendered in the corner of the AI surface:
That is a trust surface. It tells the user three things at once: the system is being watched, the system is improving (or it is not), and the user's corrections are doing work. None of that requires new infrastructure if you already log take-it / leave-it / fix-it. The accept rate plotted per workflow over time is your drift telemetry (Qian and Wexler, IUI 2024). When it drops, the user is the canary, and they should be the first to know.
The four moves line up cleanly with the UX element that carries them and the metric that proves they work:
| Design move | UX element | Metric that proves it |
|---|---|---|
| Citations on every claim | Inline source chip, one-click span scroll | Source-hallucination rate (Endex: ~10% → 0%) |
| Calibrated confidence | Per-claim badge, grounded vs ungrounded backgrounds | Accept rate by confidence bucket (curve must slope) |
| Escape hatches | this looks wrong + hand off to human buttons | Handoff and regenerate rate, untouched by KPI weighting |
| Drift monitors | In-app weekly accept-rate panel | Accept rate trend per workflow, model_version tagged |
This pairs directly with the discipline we wrote up in agents pass the benchmark, then fail in production and the production-grade telemetry baseline in the production AI checklist. Drift you can see is the consumer-facing layer on top of the audit log discipline in audit logs that survive AI traffic. Same pipeline, different consumer. The on-call engineer reads the log. The user reads the dashboard. Both prevent the same failure.
Ship the surface
Pick one AI surface you shipped last quarter. Pull the last 50 outputs. For how many can a non-expert user, in under 10 seconds, do all four of the following:
- Trace a load-bearing claim to a source sentence.
- Tell whether the system was confident or guessing on each claim.
- Take a non-AI path without a penalty in their workflow or in your telemetry.
- See whether the system is improving or degrading on their work this week.
If the answer is fewer than 40, you do not have a model problem. You have a surface problem. The good news is that surfaces ship in days. Pick the weakest of the four moves and put it in front of users next sprint. Trust will follow the surface, not the weights.
References
- Anthropic — Introducing Citations on the Anthropic API
- Anthropic — Reduce hallucinations
- Anthropic — Building safeguards for Claude
- Simon Willison — Anthropic's new Citations API
- Google Research — Take it, Leave it, or Fix it: Measuring Productivity and Trust in Human-AI Collaboration
- Ethan Mollick — Centaurs and Cyborgs on the Jagged Frontier