Execution and Delivery

The AI-native Definition of Ready, delivery in the builder era, and sprint ceremonies for agentic products.

TL;DR

AI-native features require a far more rigorous Definition of Ready, including an evals framework, failure mode UX, routing strategy, and cost ceiling.
When PMs prototype with AI coding tools, delivery starts with a working prototype, not a spec. Engineering's job shifts to hardening and scaling.
Sprint reviews for AI products must include agent trace analysis and model performance alongside the UI demo.

Discovery ensures you build the right thing. Execution ensures you build it right.

Core delivery principles

1. Small, frequent, uncoupled releases

Avoid heroic, monolithic releases. Ship a steady drumbeat of value. This lets you respond to customer needs quickly and detect problems early.

2. Instrumentation

Instrument products from the outset. Capture data to understand user behaviour, validate hypotheses, and measure impact.

3. Monitoring

Use strong alerting to quickly detect issues. Monitor not just feature success but reliability, accuracy, and performance. For AI features, this extends to model quality metrics, agent completion rates, and cost-per-query trends.

Delivery in the builder era

When PMs prototype with AI coding tools, the boundary between discovery and delivery compresses. A PM can walk into sprint planning with a functioning prototype that demonstrates value, tests assumptions, and reveals edge cases, all built in a day or two.

This changes the handoff. The sprint may start with a working prototype instead of a spec. The engineering team's job shifts from "build this from scratch" to "harden, scale, and production-ready this."

What that looks like in practice:

The PM brings a prototype, not a PRD. The prototype is throwaway code, but it's real. Engineers can interact with it, stress-test it, and identify the gaps between demo and production. Conversations become concrete, not abstract.

Engineering estimates change. "Build this feature" becomes "take this working prototype and make it reliable at scale." The estimation conversation shifts from "how long to build?" to "what's the gap between this prototype and production?" Often, the answer is smaller than starting from zero.

The DoR evolves. The Definition of Ready may include a link to the working prototype alongside (or instead of) design mocks. The prototype becomes a shared reference point for the entire team.

Quality gates remain unchanged. A prototype that works is not a product that ships. All the rigour of testing, security review, accessibility, performance benchmarking, and code review still applies. The builder era compresses the front of the cycle, not the back. The product builder chapter covers how PMs use AI coding tools to build these prototypes and where the line sits between PM-built demos and production code.

Delivery practices

1. Backlog management

The goal isn't a well-managed backlog. It's a backlog where the next item to pick up is always obvious. Different problem.

Keep the top shallow. The top one to three items should be fully ready to pull at any time. Everything below that can stay as problem statements until it's close to being picked up. Over-refining items weeks before they're needed is planning overhead that doesn't pay.

Refine on demand, not on cadence. Weekly refinement sessions made sense when execution was the bottleneck. When a team can ship something in an afternoon, the refinement cost for small items often exceeds the build cost. Refine an item when it's about to move, not on a fixed schedule. Reserve structured grooming for large, complex items that genuinely need it.

Write clear, actionable user stories for complex work. For multi-step or multi-sprint items, focus on the who, what, and why. Format: "As a [user type], I want to [take an action] so that I can [achieve a goal]." For smaller items, a plain problem statement works fine.

Integrate all work. For AI features, the backlog must include new work item types (prompt engineering tasks, eval maintenance items, model retraining tickets) to make this work visible and trackable.

Embrace the Definition of Ready. Use a shared set of criteria to ensure work is properly scoped before development starts. For AI-native features, this is non-negotiable.

The AI-native Definition of Ready

For traditional software, the Definition of Ready is simple. For AI-native features, the work is probabilistic and the DoR must be far more rigorous.

An AI-tagged backlog item is not "ready" until it meets all of the following:

Standard criteria	AI-native criteria (additive)
Story is clear	Data provenance verified – data feasibility report is attached and approved
Acceptance criteria defined	Evals framework attached – the eval set is attached. This is the acceptance criteria.
No blocking dependencies	Acceptance benchmark defined – the AC is a statistical target (e.g., "model must pass the eval set with >90% score")
Design mocks attached	Initial prompts attached – the PM has provided baseline prompts to be engineered. This is the "requirement".
	Failure mode UX defined – design mocks for model failure are included (how to handle hallucinations, low-confidence answers, user feedback)
	Ethical/bias review complete – the red team report is attached, with work to mitigate highest-priority risks. See the AI governance chapter for the full risk-tiered framework.
	Routing/orchestration strategy defined – for multi-step or agentic features, the workflow graph is documented: which model handles which step, what triggers fallback, how errors propagate
	Cost ceiling per query established – the maximum acceptable inference cost per user interaction is agreed, with a plan for staying under it at projected volume

2. Sprint and release cadence

The cadence is the predictable heartbeat of delivery. Key ceremonies, adapted for AI-native and agentic products:

Sprint planning. Must include estimation for AI-specific tasks (prompt engineering, eval creation) and prioritisation of model retraining items against new features. When a PM brings a working prototype, planning focuses on the delta between prototype and production.

Daily stand-up. Expect new blocker types beyond the usual dependency and environment issues:

"Agent workflow reliability is below threshold on step 3"
"Cost per task is exceeding the budget ceiling"
"Eval regression detected after the latest prompt change"
"The model is hallucinating on edge case X in staging"

These are first-class blockers, not footnotes.

Sprint review. For AI features, the review has two parts:

UI/functionality demo – same as traditional
Model and agent performance review – present a dashboard showing model quality vs the evals benchmark, hallucination and bias metrics, cost-per-query, and agent trace analysis for multi-step workflows. The AI product metrics chapter details the weekly review cadence and the five metrics that surface whether an AI feature is delivering real user value. A demo of a UI that triggers a low-quality model is a failed sprint. A working agent that costs 10x the budget ceiling is also a failed sprint.

Retrospective. Add new questions: "Was our eval set good enough?" "How can we improve our prompt engineering loop?" "Did the prototype-to-production handoff work smoothly?"

3. Cross-functional alignment

Product execution is a team sport. Your role is system integrator, ensuring all partners work toward the same goal.

Smaller, more autonomous teams. AI-native products favour small teams (three to five people) with broad skills. When a PM can prototype and an engineer can evaluate model quality, the team needs fewer handoffs and can move faster. Optimise for autonomy over coordination.

Proactive dependency management. Identify what your team needs from others and what others need from you, early. In smaller teams, unresolved dependencies have an outsized impact.

Single source of truth. Keep the roadmap and sprint boards always up to date. This builds trust and reduces ad-hoc questions.

Lead with a shared narrative. Provide a clear story about the problem being solved. This gives every team member the context for smart, autonomous decisions.

RACI framework. For complex initiatives spanning multiple teams, clarify who is Responsible, Accountable, Consulted, and Informed. This eliminates confusion about who owns what.

Execution doesn't end at deployment. The GTM, launch, and growth chapter covers release coordination, positioning, sales enablement, and the post-launch feedback loop that turns a shipped feature into a growth engine. For agentic features specifically, the Agentic Safety Inspection covers the operational checks — circuit breakers, cost tripwires, and drift drills — that keep agents stable once they're live.