6 articles tagged Evals

AI software quality is a production discipline. Code got cheap, but review, evals, rollback, and observability did not.

When AI writes the code, green CI isn't enough. The new discipline is understanding and defending the choices the model made — not just the ones you made.

DAU, time-in-app, and NPS were built for a world where humans do the work. AI products need different metrics. A framework for what to measure and why.

Most teams evaluate agents with manual chats and gut feel. A practical framework for eval suites that let you ship, starting with 20 examples, not 20,000.

A 5-step agent at 95% accuracy per step is only 77% reliable. The path forward isn't better agents, it's narrower ones. Three rules for workflows that ship.

Building for a single model is technical debt with a short shelf life. The winning strategy is orchestration, evals, and governance, not leaderboard loyalty.