AI-First, Human-Centred
What AI-first actually means in production, why optional AI is a gimmick, and how to build products where AI is the medium, not a feature.
TL;DR
- If removing the AI means the product still works, the AI is a gimmick. AI-first means the product is designed from the ground up with AI as the delivery mechanism, not a bolt-on.
- The design decision isn't "should humans be in the loop?" It's where on the copilot-to-autopilot spectrum each feature sits, calibrated to the risk level.
- You can't lead AI products from a slide deck. AI-first strategy starts in the playground, with leaders who've felt the latency and debugged the hallucinations.
AI-first is not "we added a chatbot." It's not a copilot sidebar. It's not an assistant that drafts emails 20% faster.
AI-first means the product was designed assuming AI is the core delivery mechanism. Remove the AI, and the product doesn't function. The workflow doesn't exist without inference. The value proposition is impossible without generative capability.
That's the litmus test. If the AI is optional, you've built a feature. If the product breaks without it, you've built something real.
What AI-first actually means
Most "AI-first" products are AI-bolted-on. The distinction matters because it determines your architecture, your cost structure, your hiring, and your competitive position.
AI-bolted-on takes an existing product and adds intelligence to existing workflows. An expense platform that auto-categorises receipts. A CRM that suggests next actions. A project tool with a summarisation button. The user still navigates the same screens, fills out the same forms, follows the same multi-step process. You've added inference cost without removing workflow cost. That's a margin trap.
AI-first eliminates the workflow entirely. An agent monitors your transactions, categorises expenses against company policy, assembles the report, and routes it for approval. The user never logs in. The form never exists. Same outcome, different product, different architecture, different economics.
The uncomfortable implication: some of your most complex, most differentiated features are liabilities in an AI-first world. The complexity you're proud of is exactly the thing an agent can route around.
Multi-model orchestration as default architecture
No single model wins every task. If you're hard-coding your product to one model, you're accumulating technical debt that compounds with every release cycle.
AI-first architecture assumes multi-model orchestration. Route reasoning tasks to one model, summarisation to a cheaper one, code generation to a third. Make those decisions at runtime based on cost, latency, and quality requirements.
Your routing layer and your eval framework are your competitive advantage. Your model provider can release a better model tomorrow. They can't release a better understanding of your users' needs.
Inference economics that surprise you at scale
AI features introduce high, variable COGS. A feature that costs $0.002 per call in a demo might require four or five inference calls per user interaction in production: one to understand the query, one to retrieve context, one to generate the response, one to check safety. The per-interaction cost can be 10x to 20x what the slide deck projected.
Model your COGS before pricing. Understand prompt caching (a 90% cost reduction changes the entire business model). Budget for the audit tax if you're running multi-agent workflows. Leaders who haven't watched an inference bill scale with real usage will get the economics wrong.
Building for trust and control
"AI-first" without "human-centred" produces products nobody trusts, nobody adopts, and nobody wants to be accountable for. The fix isn't blanket human oversight on everything. It's calibrating the level of control to the risk level of each feature.
The copilot-to-autopilot spectrum
Not every feature needs the same level of autonomy. The product decision is where on the spectrum each feature sits:
| Level | User role | Agent role | Example |
|---|---|---|---|
| Copilot | Decides and acts. Agent suggests. | Draft, recommend, surface options. | Email draft suggestions, code completions. |
| Co-driver | Reviews and approves. Agent proposes. | Plan and propose. Human confirms. | Proposed schedules, suggested review comments. |
| Supervised autopilot | Monitors. Steps in on exceptions. | Execute within bounds. Escalate when uncertain. | Automated ticket triage with human review of escalations. |
| Full autopilot | Sets goals. Reviews outcomes periodically. | Execute end-to-end autonomously. | Background data processing, monitoring alerts. |
Features should graduate up this spectrum as reliability improves and trust builds. Don't launch at full autopilot. Launch at copilot, measure, and promote.
The single biggest product mistake in agentic AI: starting at the wrong autonomy level. Too much autonomy with unproven reliability destroys user trust. Too little autonomy with proven reliability wastes the capability.
The 95% trap
A 5-step agentic workflow at 95% accuracy per step delivers 77% system reliability. At 10 steps, 60%. At 20 steps, 36%.
Enterprise buyers expect 99%+. Compounding accuracy math is invisible from a strategy deck and obvious from a terminal. It is the primary reason agentic products stall at prototype stage.
The fix is narrower workflows, not better models. Build SOPs wrapped in code: each step validated independently to 99%+ before chaining. Target patience-heavy tasks (processing 500 invoices, scanning 40 documents), not judgment-heavy ones (choosing a vendor, deciding product strategy). Constraint is clarity. Every degree of freedom you remove from the agent's decision space is a failure mode you've eliminated.
Design for imperfection
AI models will produce wrong outputs. Your architecture needs to account for this from day one, not as an afterthought.
Build confidence scoring into every output. Define human escalation thresholds explicitly. Create graceful degradation paths: when the model is uncertain, the product should behave predictably, not confidently wrong. Spot-check architectures that route only low-confidence outputs to expensive review processes keep costs sustainable while maintaining quality.
The builder-leader requirement
You cannot direct what you don't understand at an intuitive level.
AI has a property that makes it uniquely dangerous to lead from a distance: the failure modes are non-obvious. A hallucination doesn't throw an error. Latency spikes don't show up in quarterly reviews. Prompt drift happens slowly. Cost overruns compound quietly. You only develop intuition for these things by building systems that encounter them.
This isn't a suggestion. It's a requirement for AI product leaders.
Build something real. Not a hackathon demo. A system with real users, real costs, and real failure modes.
Feel the latency. Deploy an AI feature and watch real users wait. It changes how you prioritise.
Debug a hallucination in production. Understand why your carefully crafted prompt produced nonsense for one specific input that worked perfectly for every other.
Manage inference costs. Watch the bill scale and figure out how to make the unit economics work.
Leaders who've done this ask better questions, make better tradeoffs, and can smell when a vendor demo is hiding complexity. Leaders who haven't are operating on borrowed intuition, setting direction for a material they've never touched.
Speed to learning, not speed to shipping
Every AI product initiative is an experiment. The models are probabilistic. The failure modes compound. User behaviour with AI is poorly understood. The goal is measurable learning about user value and model quality, not features shipped.
Three practices make this concrete.
Eval-driven development
Evals are day-one infrastructure, not a post-launch monitoring phase. Before you write the first prompt, define what "good output" looks like. Write those criteria as test cases. Start with 20 examples drawn from real scenarios, not 2,000 from imagination.
Grade the execution path, not just the final output. Did the agent loop unnecessarily? Call the expensive model when the cheap one would do? Take 30 steps when 5 would suffice?
Run the suite on every change that could affect agent behaviour. Teams without evals face weeks of manual testing when a new model drops. Teams with evals run the suite in hours and make a data-driven decision.
Prototype with AI coding tools
The fastest path to learning is building. AI coding tools (Cursor, Claude Code, Copilot) compress the time from hypothesis to working prototype from weeks to hours. Use them. A PM who can prototype a workflow in an afternoon generates more strategic insight than a PM who writes a spec and waits three sprints for engineering to build it.
This isn't about replacing engineers. It's about closing the feedback loop between product thinking and production reality.
Continuous model monitoring
Models degrade. Prompt performance drifts as input distributions change. New model versions introduce subtle behavioural shifts. The accuracy you measured at launch is not the accuracy you have six months later.
Monitor per-step accuracy in production, not just system-level metrics. Track cost per interaction over time. Set automated alerts for quality regressions. Treat model monitoring with the same rigour you'd apply to infrastructure monitoring.
What AI-first PMs look like
| Behaviour | In practice |
|---|---|
| Builds to learn | Prototypes AI workflows directly. Has debugged a hallucination, felt latency, and managed an inference bill. Strategy is grounded in production experience. |
| Designs the spectrum | Explicitly decides where each feature sits on the copilot-to-autopilot scale. Doesn't default to "human in the loop" or "full autonomy" without evidence. |
| Knows the math | Can calculate compounding accuracy for multi-step workflows. Won't ship below the reliability threshold. Models inference COGS before setting a price. |
| Evals from day one | Defines success criteria before writing the first prompt. Runs automated eval suites on every change. Treats every production failure as a new test case. |
| Orchestrates, not commits | Builds for multi-model routing. Treats model selection as a continuous optimisation problem, not a one-time vendor decision. |
Anti-patterns
The chatbot sticker. Adding a chat interface to an existing product and calling it AI-first. If the underlying workflow is unchanged, you've added cost without removing cost.
The everything agent. An agent with access to all tools, all data, and a system prompt the length of a novel. Demos beautifully, hallucinates in production, costs a fortune. Decompose into narrow, single-purpose workflows.
The accuracy promise. Committing to deterministic accuracy from a probabilistic engine on a roadmap. "The AI will correctly classify 100% of tickets" is a promise made by someone who has never tried to get a language model to do something consistently across thousands of varied inputs.
The vibe eval. Evaluating agent quality through manual spot checks and gut feel. "Yeah, that looks right" is not a quality bar. It's the primary reason agents stall at prototype stage.
The single-model bet. Hard-coding to one model provider because their last benchmark looked good. By the time you ship, the leaderboard will have flipped. Build the routing layer.
The cost surprise. Pricing AI features without modelling inference COGS at production volume. A feature that looks profitable at 100 users can destroy margins at 10,000 because inference costs scale linearly while revenue might not.