Execution and Delivery
The AI-native Definition of Ready, delivery in the builder era, and sprint ceremonies for agentic products.
TL;DR
- AI-native features require a far more rigorous Definition of Ready, including an evals framework, failure mode UX, routing strategy, and cost ceiling.
- When PMs prototype with AI coding tools, delivery starts with a working prototype, not a spec. Engineering's job shifts to hardening and scaling.
- Sprint reviews for AI products must include agent trace analysis and model performance alongside the UI demo.
Discovery ensures you build the right thing. Execution ensures you build it right.
Core delivery principles
1. Small, frequent, uncoupled releases
Avoid heroic, monolithic releases. Ship a steady drumbeat of value. This lets you respond to customer needs quickly and detect problems early.
2. Instrumentation
Instrument products from the outset. Capture data to understand user behaviour, validate hypotheses, and measure impact.
3. Monitoring
Use strong alerting to quickly detect issues. Monitor not just feature success but reliability, accuracy, and performance. For AI features, this extends to model quality metrics, agent completion rates, and cost-per-query trends.
Delivery in the builder era
When PMs prototype with AI coding tools, the boundary between discovery and delivery compresses. A PM can walk into sprint planning with a functioning prototype that demonstrates value, tests assumptions, and reveals edge cases, all built in a day or two.
This changes the handoff. The sprint may start with a working prototype instead of a spec. The engineering team's job shifts from "build this from scratch" to "harden, scale, and production-ready this."
What that looks like in practice:
The PM brings a prototype, not a PRD. The prototype is throwaway code, but it's real. Engineers can interact with it, stress-test it, and identify the gaps between demo and production. Conversations become concrete, not abstract.
Engineering estimates change. "Build this feature" becomes "take this working prototype and make it reliable at scale." The estimation conversation shifts from "how long to build?" to "what's the gap between this prototype and production?" Often, the answer is smaller than starting from zero.
The DoR evolves. The Definition of Ready may include a link to the working prototype alongside (or instead of) design mocks. The prototype becomes a shared reference point for the entire team.
Quality gates remain unchanged. A prototype that works is not a product that ships. All the rigour of testing, security review, accessibility, performance benchmarking, and code review still applies. The builder era compresses the front of the cycle, not the back.
Delivery practices
1. Backlog management
A well-managed backlog is the single source of truth. Key practices:
Write clear, actionable user stories. Focus on the who, what, and why. Format: "As a [user type], I want to [take an action] so that I can [achieve a goal]."
Integrate all work. For AI features, the backlog must include new work item types (prompt engineering tasks, eval maintenance items, model retraining tickets) to make this work visible and trackable.
Embrace the Definition of Ready. Use a shared set of criteria to ensure work is properly scoped before development starts. For AI-native features, this is non-negotiable.
Practise continuous refinement. Collaboratively review and break down upcoming items weekly to reduce friction and improve sprint quality.
The AI-native Definition of Ready
For traditional software, the Definition of Ready is simple. For AI-native features, the work is probabilistic and the DoR must be far more rigorous.
An AI-tagged backlog item is not "ready" until it meets all of the following:
| Standard criteria | AI-native criteria (additive) |
|---|---|
| Story is clear | Data provenance verified – data feasibility report is attached and approved |
| Acceptance criteria defined | Evals framework attached – the eval set is attached. This is the acceptance criteria. |
| No blocking dependencies | Acceptance benchmark defined – the AC is a statistical target (e.g., "model must pass the eval set with >90% score") |
| Design mocks attached | Initial prompts attached – the PM has provided baseline prompts to be engineered. This is the "requirement". |
| Failure mode UX defined – design mocks for model failure are included (how to handle hallucinations, low-confidence answers, user feedback) | |
| Ethical/bias review complete – the red team report is attached, with work to mitigate highest-priority risks | |
| Routing/orchestration strategy defined – for multi-step or agentic features, the workflow graph is documented: which model handles which step, what triggers fallback, how errors propagate | |
| Cost ceiling per query established – the maximum acceptable inference cost per user interaction is agreed, with a plan for staying under it at projected volume |
2. Sprint and release cadence
The cadence is the predictable heartbeat of delivery. Key ceremonies, adapted for AI-native and agentic products:
Sprint planning. Must include estimation for AI-specific tasks (prompt engineering, eval creation) and prioritisation of model retraining items against new features. When a PM brings a working prototype, planning focuses on the delta between prototype and production.
Daily stand-up. Expect new blocker types beyond the usual dependency and environment issues:
- "Agent workflow reliability is below threshold on step 3"
- "Cost per task is exceeding the budget ceiling"
- "Eval regression detected after the latest prompt change"
- "The model is hallucinating on edge case X in staging"
These are first-class blockers, not footnotes.
Sprint review. For AI features, the review has two parts:
- UI/functionality demo – same as traditional
- Model and agent performance review – present a dashboard showing model quality vs the evals benchmark, hallucination and bias metrics, cost-per-query, and agent trace analysis for multi-step workflows. A demo of a UI that triggers a low-quality model is a failed sprint. A working agent that costs 10x the budget ceiling is also a failed sprint.
Retrospective. Add new questions: "Was our eval set good enough?" "How can we improve our prompt engineering loop?" "Did the prototype-to-production handoff work smoothly?"
3. Cross-functional alignment
Product execution is a team sport. Your role is system integrator, ensuring all partners work toward the same goal.
Smaller, more autonomous teams. AI-native products favour small teams (three to five people) with broad skills. When a PM can prototype and an engineer can evaluate model quality, the team needs fewer handoffs and can move faster. Optimise for autonomy over coordination.
Proactive dependency management. Identify what your team needs from others and what others need from you, early. In smaller teams, unresolved dependencies have an outsized impact.
Single source of truth. Keep the roadmap and sprint boards always up to date. This builds trust and reduces ad-hoc questions.
Lead with a shared narrative. Provide a clear story about the problem being solved. This gives every team member the context for smart, autonomous decisions.
RACI framework. For complex initiatives spanning multiple teams, clarify who is Responsible, Accountable, Consulted, and Informed. This eliminates confusion about who owns what.