Hello Hello…

A year ago, if you asked an AI team what their product spec looked like, you would get a Notion page. Features. User stories. Acceptance criteria. A PRD.

Now ask the same question. The answer is different.

You get a folder of eval files. Test cases. Rubrics. Pass-rate dashboards. The PRD still exists — but it is no longer the thing the team builds to. The evals are.

This shift is real and it is reshaping how AI products get built. "Evals are the new PRD" is not a clever phrase. It is a description of what is actually happening inside the best AI product teams right now.

If you wrote a PRD last year and the engineering team shipped something that did not match, you had a clear conversation about what went wrong. With AI, the same prompt can produce different outputs every time. A static document cannot define "done" for something non-deterministic. You need a test that runs continuously and tells you, in real numbers, whether the system is behaving the way you specified.

That is what an eval is. And that is why evals — not PRDs — are becoming the primary artifact PMs ship.

If you are new to evals entirely, the original eval guide covers the foundations.

What Are Evals And Why Are They Important— A Guide For Product Managers

Find out why evaluations are the hottest trend in AI companies and how mastering evals can boost your product management career in artificial intelligence.

JustAnotherPM • Sid Arora

Today’s article assumes you know the basics. What has changed is how sophisticated evals have become — and how much more product work they now absorb.

Why PRDs Break For AI Products

A PRD works for deterministic software. Click "Save" → it saves. Engineer implements, QA tests, ship.

AI breaks every assumption in that sentence. The same input produces different outputs. A prompt change that makes one behaviour better makes three others worse. Models update and your product behaves differently — without anyone on your team touching the code.

A PRD cannot keep up with this. By the time you have edited the document, the product has already drifted.

Evals solve this because they are continuously running specifications. Every change — a new prompt, a new model, a new tool — runs through the eval suite before shipping. If the pass rate drops, you do not ship. If it holds, you ship with confidence. The spec is never out of date because the spec is executable.

This is why the best AI product teams have quietly stopped treating evals as a QA function. Evals are how the product is defined, not how it is tested after the fact.

Why traditional PRDs do not work for AI products

Example: SupportBot

To make this concrete, I am going to use one product throughout this article and build real evals for it at every level.

SupportBot: an AI customer support agent for a SaaS project management tool called Flowdesk. SupportBot can answer questions about the product, look up a customer's account data, process refunds, and escalate to a human agent when needed.

Here are three behaviours the PM has written into the spec:

If a customer mentions a billing dispute, hand off to a human within two turns.
Never guess account data — always use the account lookup tool before answering questions about a customer's plan, billing, or usage.
If a customer asks for a refund, verify identity, check the refund policy, process the refund through the billing tool, then confirm to the customer.

These are reasonable requirements. The problem is that a PRD cannot verify them. Let's build evals that can.

The Three Types of Evals You Need in 2026

When I first wrote about evals, most AI products were single-turn. A user asks a question, the model answers. Evaluate the answer. Simple.

That world is mostly gone. AI products now hold multi-turn conversations, use tools, take sequences of actions, and work inside agent loops. Your eval approach has to match.

Here are the three levels — and what SupportBot's evals look like at each one.

Three levels of Evals - each helping measure quality of the output

How did you like this edition?

1. Single-Turn Evals — The Foundation

One input. One output. One rule.

For SupportBot, behaviour #1 from the spec becomes this eval:

Eval #1: Billing dispute → must hand off

Rule: If the user mentions a billing dispute in their message, the response must include a clear handoff to a human agent. TRUE/FALSE.
Examples: 50 real customer messages that mention billing disputes, pulled from support logs.
Pass target: 48/50 or above.

That is it. You run this against every version of SupportBot before shipping. If a prompt change drops it to 40/50, you do not ship.

Eval #2: No guessing on account data

Rule: If the user asks about their current plan, usage, or billing details, the response must not state any specific account data unless the account lookup tool was called first. TRUE/FALSE.
Examples: 40 messages asking account-specific questions.
Pass target: 40/40. This one is binary — zero guessing is acceptable.

Single-turn evals are fast, cheap, and easy to reason about. They catch obvious regressions. Every team needs them.

But they miss almost everything that actually breaks in production. Because most AI products are not single-turn anymore.

2. Multi-Turn Evals — Where Real Products Live

Real users do not ask one question and leave. They have conversations. They clarify. They change their minds. They ask follow-ups that depend on what was said three turns ago.

Multi-turn evals test behaviour across a conversation, not within a single response.

Things single-turn evals cannot catch but multi-turn evals can:

Context drift — the model starts responding as a different persona halfway through the conversation
Knowledge retention — the model forgets something the user said two turns ago and asks for it again
Role adherence — the model stays in character for five turns, then breaks character when something unexpected happens
Conversation completeness — the model ends the conversation without actually solving the user's problem

For SupportBot, here is what a multi-turn eval looks like in practice:

Eval: Billing dispute handoff within two turns

Rule: Within two assistant turns of the user first mentioning a billing dispute, the assistant must offer to connect them with a human. A conversation where this does not happen by turn 6 is a FAIL regardless of what happens after.
Examples: 30 simulated conversations where billing disputes come up at different points — turn 1, turn 3, turn 5. Some also include the user changing topics before the dispute, to test whether SupportBot still catches the signal.
Pass target: 28/30.

You can run this two ways: evaluate the entire conversation holistically, or evaluate turn-by-turn using a sliding window of the last few turns as context. Most production teams do both — holistic for overall experience, turn-by-turn for finding exactly where SupportBot lost the thread.

The examples in your test set are no longer single messages. They are entire transcripts — either simulated or sampled from real logs — with the rule applied at the end or at each turn.

This is meaningfully harder to build than single-turn evals. But it is where your product actually lives.

3. Agent Evals — Did It Do The Right Thing Or Just Produce The Right Output?

Agent products add a third dimension: the sequence of actions the agent takes, not just what it ends up saying.

Here is where behaviour #3 — the refund flow — becomes the most important eval in SupportBot's suite.

The scenario: A customer says: "I want a refund for my last invoice."

A single-turn eval asks: did the final response confirm the refund?
A multi-turn eval asks: did the whole conversation resolve the issue?
An agent eval asks: did SupportBot take the right sequence of actions to get there?

Eval: Refund trajectory

Ideal trajectory: lookup_account → verify_identity → check_refund_policy → process_refund → confirm to user
Rule: The agent must call all four tools in a logical order before sending a confirmation message.
1. Confirming a refund without calling process_refund is a FAIL.
2. Calling process_refund without first calling verify_identity is a FAIL.
Examples: 20 refund conversations with varying customer messages — some polite, some frustrated, some ambiguous about which invoice.
Pass target: 18/20.

This catches a specific failure mode the other eval types miss entirely: If SupportBot gets the right answer for the wrong reason. Say, it responded "your refund has been processed" without actually processing it. It confirmed the customer's account details by guessing instead of looking them up.

Agent evals come in three flavours:

Final response evals — check only the end result (single-turn logic applied to agents)
Trajectory evals — check the full sequence of actions the agent took
Single-step evals — check each individual tool call in isolation

Trajectory evals are the most powerful and the most expensive to build. You need to define what the ideal action sequence looks like for each test case, then compare the agent's actual sequence against it. For high-stakes agents — anything taking real actions on real systems — trajectory evals are the only way to know if they are actually doing the right thing, not just saying the right thing.

LLM-As-Judge — And Why You Have To Calibrate It

Hand-labelling 50 conversations after every prompt change is not sustainable. So most teams use an LLM as a judge — give another model the rule, the input, and the response, ask it to output TRUE or FALSE.

For the billing dispute eval, the judge instruction looks like this:

Here is a customer support conversation.

Here is the rule: within two assistant turns of the user mentioning a billing dispute,the assistant must offer to connect them with a human.

Conversation: [transcript]

Did the assistant follow the rule? Output only TRUE or FALSE.

Fast. Scalable. Useful.

Here is what nobody tells you: LLM judges carry consistent biases. One study found that simply swapping the order of two candidate answers caused GPT-4's judgment to flip — and that a weaker model could beat ChatGPT on 82% of test cases just by appearing first in the prompt. The biases are predictable:

Position bias — judges favour responses presented first
Length bias — judges rate longer responses higher regardless of quality
Agreeableness bias — judges over-accept outputs without challenging them

You can combat all three. Randomise order. Penalise length in the rubric. Force the judge to produce evidence before scoring. But the single most important thing is calibration.

Take 100 examples. Label them by hand. Run the judge on the same 100. Check agreement.

The PM's New Job

The skill that separates great AI PMs from average ones in 2026 is not prompt engineering. It is eval engineering.

Defining, in testable terms, what good behaviour looks like. Writing rubrics a machine can apply consistently. Calibrating judges against human judgement. Deciding which failures matter and which are acceptable. Running the suite on every change and making ship decisions from the numbers.

This is product work — harder than writing a spec for a web form. But the AI PMs who do it well become indispensable, because they are the only people on the team who can say, with real data, whether the product is better today than it was yesterday.

Engineers cannot do this alone. They can hill-climb the evals, but someone has to write them. Someone has to decide that "the agent must hand off if the customer mentions legal action" matters more than "the agent should sound friendly." That decision requires product judgement.

Every AI product team I respect has a PM who owns the evals. Sometimes it is their entire job. Always it is the most important part of their job.

Where To Start

If your team does not have evals yet, do not try to build everything at once.

Go back to the three SupportBot behaviours from earlier. Pick the one that would most embarrass your company if it failed publicly. For most products, that is the refusal or escalation behaviour — the thing the agent must never get wrong.

Write one strict TRUE/FALSE rule for it. Pull 30 real examples from your logs. Label them by hand. Get a baseline.

That is your MVP. A day of work. One eval set.

Then expand. Add multi-turn cases as you see conversational failures in production. Add trajectory evals when you add agentic capabilities. Add an LLM judge when hand-labelling becomes a bottleneck, and calibrate it before you trust it.

The product that has evals and the product that does not are not competing in the same league anymore. The eval-driven team ships faster, regresses less, and makes better decisions about what to build next. The team without evals is shipping on vibes.

Evals are not homework. They are the spec.

The PMs who win in AI will not be the ones who write the best prompts. They will be the ones who define what "good" looks like more precisely than anyone else on the team — and then measure it on every single ship.

That’s it for today
—Sid

Evals Are The New PRD