The POC convinced the CFO. The on-call rotation has not been written yet. Somewhere between those two facts, an engineering leader has to answer a different question than the one the demo answered: is this thing ready for real users, real money, and a 3am page? Most AI projects die in that gap.
Not because the model is wrong. Because the operating discipline around it was never built, and the cost of that absence is concrete: an audit demand you cannot answer, a silent regression you cannot roll back, an incident with no trace, a deprecated model that still serves 4% of traffic six months after retirement.
This is the checklist DAD applies to every enterprise AI build before we call it shippable. It is organised against the four functions of the NIST AI Risk Management Framework (Govern, Map, Measure, Manage), not because NIST is mandatory for most readers, but because the internal audit team you will eventually sit across the table from already speaks that vocabulary. The four columns are the spine. The artefacts inside them are the work.
A checklist of features is not a checklist. An operating discipline is.
The version of this checklist we shipped in 2024 had four bullets: security, audit, monitoring, maintenance. It was right in spirit and a generation behind in detail. Agentic systems and generative models exposed risk classes the original framing never named.
- Prompt injection that survives one model swap and not the next.
- Tool-call traces that no logging system was designed to ingest.
- Prompt regressions that look exactly like model regressions on the dashboard and require an entirely different rollback.
The NIST AI RMF gives the canonical structure to replace the old four-bullet list. Four functions, each with a Playbook of suggested actions and related guidance for Govern, Map, Measure, and Manage. For generative systems specifically, NIST AI 600-1 enumerates 13 risks novel to or exacerbated by GAI and pairs them with over 400 suggested actions. Provenance, confabulation, data leakage, harmful content, CBRN and cyber misuse: first-class risk categories, not afterthoughts you address in the security review the week before launch.
Use NIST as scaffolding, not as the topic. Engineering leaders do not need a compliance lecture. They need to know which artefact to build first, who owns it, and how an auditor will read it. The next four sections are that map.
Govern: the artefacts an auditor will ask for on day one.
The first question an auditor asks is not technical. It is "show me the inventory." If you cannot produce a current list of every model deployed in production, with owner, version, training data lineage, evaluation results, and last review date, the conversation is already going badly.
Uber's engineering team published their version of this artefact: a centralised Model Catalog with auto-populated Model Cards per deployed model, with feature attribution computed post-training and automatically linked into the card. The Model Card is the one-page answer to the four questions every reviewer asks: what is this, who owns it, what does it do, when did it last change. Auto-populated is the operative word. A Model Card maintained by hand decays inside a quarter. A Model Card generated by the same pipeline that ships the model stays current because it has to.
Uber's other move is the one most teams under-invest in. They integrated governance into the ML life cycle through a shift-left approach, bringing governance checks into the earliest planning stages rather than the release process. Governance at the release gate is theatre. The PM and the engineer have already agreed on the launch date. The compliance review is a forced choice between a slip and a waiver, and the waiver wins every time. Move the checks into planning and the PR template, and you stop fighting the same fight every quarter.
The audit trail sits inside Govern because the question it answers is a governance question: can you reconstruct what the system did, for whom, on what version, when an auditor or a regulator asks. Most teams log. Few can query. Fewer can reconstruct a specific incident inside an hour. For an agentic system the tool-call trace is the audit log:
- Inputs (or input hashes when the raw text is sensitive).
- Model and prompt versions pinned per call.
- Tool sequence with arguments and responses.
- Confidence scores and user context.
- Timestamps and recovery paths taken when a tool failed.
Schema it early. Retrofit it and you discover halfway through a regulator's request that the trace dropped the prompt_version field on a refactor two quarters ago.
The Govern column on the checklist:
- Model Catalog with auto-populated Model Cards, linked to feature attribution and the eval suite that gates promotion.
- RBAC mirrored to org structure, IdP integration, secrets rotation, scoped API keys per service. The 2024 checklist was right about this; keep it.
- Decision rights and RACI for model approval, retraining, and deprecation. Who can promote a model from staging? Who signs off on a prompt change? Who decides a model is retired? Write the names down.
- Shift-left controls in the PR template, not the release gate. Compliance checks that run on every PR cost almost nothing per PR and almost everything when skipped.
- Structured trace per inference with append-only storage, documented retention aligned to the regulatory regime in scope, and indexing for query rather than archival. The success metric is mean time to reconstruct an incident, not bytes stored per day.
- PII handling documented at the trace schema level. Hashes, redactions, and key-rotation policy in the schema itself, not in a wiki page next to it.
Map: name the risks before the demo, not after the incident.
Map is the column most engineering teams skip and most regulators ask about first. The Map function is where you write down what the system is actually for, who it affects, what it is allowed to do, and which failure modes you have decided to accept.
For an agentic build, the Map column has to enumerate at minimum:
- Tools the agent is permitted to call.
- Data scopes those tools can read or write.
- User categories the system is built for, and the categories it is not built for.
- Regulatory regimes in scope.
- Threat models you have decided to defend against versus accept.
For a generative system, the NIST GAI Profile risk categories are the cleanest starting taxonomy: data provenance and integrity, confabulation, harmful content, IP exposure, human-AI interaction failures, value-chain transparency, CBRN and cyber misuse. The deliverable is a one-page risk register, not a compliance binder. Walk down the list with the team that built the system and document where each risk lives, who watches it, and what the threshold is. If a category does not apply, write why. If it does, write the eval that proves you can detect it.
The Map column on the checklist:
- Use-case scoping document with permitted and prohibited uses.
- Threat model covering prompt injection, data exfiltration via tool calls, jailbreaks, training-data poisoning, supply-chain risks on third-party models.
- Risk register against the GAI Profile categories with owner, detection mechanism, and acceptance threshold per row.
- Data lineage: every dataset used to train, fine-tune, or retrieve is documented with provenance and licence, not just a row in a Notion table.
Measure: evals before features, monitoring beyond latency.
Measure is where the 2024 checklist was thinnest. "Model performance metrics over time" is not a column heading. It is a placeholder.
The eval suite is the dev loop, not a launch gate. Anthropic's engineering team puts it directly:
— Anthropic engineering, Demystifying evals for AI agentsGood evaluations help teams ship AI agents more confidently. Without them, it's easy to get stuck in reactive loops, catching issues only in production.
The operational shape is consistent: combine code-based, model-based, and human graders; seed eval sets from real production failures; treat the suite as a maintained system that grows weekly, not a snapshot taken before launch.
Score the system, not the model. AWS's account of evaluating agents at Amazon names the shift directly: agentic AI systems require a fundamental shift in evaluation methodologies that assesses not only the underlying model performance but also the emergent behaviors of the complete system. For agents, that means tool-selection accuracy, multi-step reasoning coherence, and error-recovery rate are first-class signals on the same dashboard as latency and cost. Not nice-to-haves.
Then track reliability, not just capability. pass@1 is a coin flip dressed up as an SLA. pass^k, the probability that the agent succeeds on all k repeated runs of the same task, is the number that survives contact with users. We make the full argument in /blog/2026-05-12-your-agent-passes-the-benchmark-it-will-fail-in-production, including the diagnostic dashboard (Reliability Decay Curve, Variance Amplification Factor, Graceful Degradation Score, Meltdown Onset Point) and the fault-injection harness that produces them. Wire those four into the Measure column from day one.
GAI-specific evals belong in the same suite: provenance verification rates, confabulation rates against a ground-truth set, leakage probes for PII and prompt-content, prompt-injection survival rates against a maintained adversarial corpus. The GAI Profile suggested actions are the catalogue; the eval cases are yours to write.
Cost-per-inference and budget alarms sit on the same dashboard as accuracy. We treat a 3x cost regression as a reliability regression in disguise: it usually means a retry loop you did not write or a prompt that grew by 4,000 tokens during a refactor nobody flagged.
The Measure column on the checklist:
- Eval suite in CI on every PR. Code-based, model-based, and human graders.
pass^ktracked alongsidepass@1. A drop inpass^kof more than 5 points fails the check even whenpass@1ticks up. - System-level signals: tool-selection accuracy, multi-step coherence, recovery rate, not just final-answer accuracy.
- GAI risk evals: provenance, confabulation, leakage, harmful-content rate, prompt-injection survival, run nightly against the adversarial corpus.
- Production-seeded eval set. Every incident produces at least one new eval case within a week.
- Cost and accuracy on the same chart. Budget alarms route to the same on-call as accuracy alarms.
Manage: rollback, incident response, deprecation.
Rollback is the artefact most teams discover they do not have at the moment they need it. The first version of the runbook is always written during the incident, which is always the worst time to write it.
Version the triple, not just the model weights. A prompt change is a model change. An eval-set change is a model change. Storing model weights in S3 with a hash but tracking prompts in a Google Doc and eval cases in someone's local fork is the configuration of a team that has not yet had to roll back at 3am. Versioned model, versioned prompt, versioned eval set, deployed together as one immutable artefact.
Make canary and shadow the default. A new model serves 1-5% of traffic with shadow scoring against the incumbent until the eval delta is positive and stable. Promotion is a configuration change, not a deploy. Rollback is the same configuration change in reverse. The runbook for rollback should fit on one page and have an SLO on time-to-revert. If you do not know how long it takes to revert, you are not ready to ship.
Incident response closes the loop back into evals. Anthropic's framing is direct: "catching issues only in production" is the failure mode evals exist to prevent. Every incident produces at least one new eval case, ideally within a week. The eval suite at month six does not look like the eval suite at launch. That is the point. The suite is a written record of everything that has ever surprised you, and it gets longer.
Deprecation is the column most teams skip and most auditors check. NIST Manage explicitly covers retirement. A deployed model with no documented retirement plan is a liability with no end date. Write the plan when you write the deployment plan:
| Artefact | What it captures | Failure mode it prevents |
|---|---|---|
| Retirement triggers | Drift, cost, capability replacement thresholds | Models that linger past their useful life |
| Migration path | How traffic moves to the successor model | The 4%-of-traffic ghost deployment |
| Trace retention | Data-handling rules for the historical log | An audit request you cannot fulfil |
| Sunset timeline | Dates with named owners | Indefinite "we'll get to it" |
The Manage column on the checklist:
- Versioned model, prompt, and eval set deployed as one immutable artefact.
- Canary plus shadow as the default promotion path, with an explicit eval-delta gate.
- Rollback runbook with an SLO on time-to-revert and an annual drill against the SLO.
- Incident-to-eval pipeline so the suite grows from production reality, not internal speculation.
- Deprecation plan per deployed model, owned by the same RACI as promotion.
What to do this week.
Pick the column where your current production system is weakest. Most leaders, reading this list, will pick Govern or Manage; almost no one picks Map, because the risk register is usually the work that nobody scheduled. Whichever column you pick, write the missing artefact this week. One page. Owner named. A real example walked through end to end.
Ship-readiness is a four-column promise, not a single milestone. The Model Catalog, the trace schema, the risk register, the eval suite, the rollback runbook, the deprecation plan. Six artefacts spread across four columns. If you can produce all six on demand for any deployed model, your next audit is a conversation, not a fire drill. If you cannot, the gap is the work.