Hey hey,

Every product team has a review process. A product manager writes a PRD and circulates it to design, engineering, legal, operations, science, and leadership.

The point is to catch problems early and reduce risk. In practice, those reviews often turn into detective work. A reviewer notices a growth assumption.

Someone points out a success metric with no guardrail. A science partner mentions that another team already tested this idea in an experiment nobody remembered.

So, the meeting that is supposed to be a judgement session becomes about reconstruction, surfacing adjacent impacts, digging up prior context, and asking questions that would have been far more useful weeks earlier.

Lakshmi Ashok, a product lead at Uber, named the problem clearly:

It isn't that PMs lack rigour. It's that good product decisions need a 360-degree view that is almost impossible to assemble by hand. Adjacent impacts, partner concerns, prior experiments, hidden dependencies, and the questions a senior reviewer is likely to ask. All of it sits scattered across documents, decks, dashboards, and people's memories.

So in May 2026, Uber shipped a tool to assemble that view for you: the PRD Evaluator, an AI that reviews your PRD before any human does.

Why a First-Pass Reviewer, Not a New Process

Uber runs product development through a structured checkpoint process.

It's a series of gated reviews that give leadership visibility and keep execution consistent. Intentionally, the team behind the Evaluator did not touch that process.

Their reasoning is the kind of thing PMs will recognise.

A checkpoint is only as good as the material entering it.

If a PRD arrives weak, the checkpoint burns its energy fixing the document instead of making the call. Rather than redesign the forum, they decided to improve the input.

That led to a simple question: what if every PM had a fast, contextual first-pass reviewer before a PRD ever reached the expensive review rooms?

From draft to scorecard is how it works. The Evaluator runs in four steps.

Step One: Build a Knowledge Base Around the PRD.

A weaker tool would just read your document and grade the writing. The Evaluator treats the PRD as a starting point. It will then look across linked documents, related decks and meeting notes, prior experiments, and cross-functional artefacts.

On top of that, it carries a preloaded layer of Uber-specific context, such as the company's core principles, metric definitions, and key jobs to be done.

This is an applied version of a technique called retrieval-augmented generation, or RAG.

Instead of relying only on what an AI model already learned during training, you first retrieve the specific, current documents that matter, then hand them to the model alongside the question.

Think of it as a new reviewer who, before writing a single comment, reads every related doc and past experiment overnight and remembers all of them.

Step Two: Classify the PRD to Set the Review Depth.

Not every document deserves the same scrutiny, and a tool that treats a button-colour tweak as seriously as a pricing change loses people's trust.

So the Evaluator first classifies the proposal and calibrates. A user-experience parity or discoverability change gets a lighter pass. An incremental workflow change or an internal tooling migration gets a moderate one. A net-new capability gets a full review.

Anything touching policy, pricing, or the marketplace gets a full review with specialised scrutiny. That calibration is a quiet but important product decision. It keeps the tool from crying wolf.

Step Three: Assess Launch Readiness Across Dimensions.

The Evaluator then grades the PRD against a fixed set of axes. It checks:

  • The opportunity and hypothesis, if the problem is real, and is success defined clearly enough to evaluate?

  • Product scope: Is the proposal understandable, well-scoped, and ready for a decision?

  • User experience and impact across segments, geographies, and edge cases.

  • Metric and data rigour: does the PRD define success, name its guardrails, and offer a credible way to validate the bet?

Under the hood, this is a pattern the AI industry has been refining for a while, which is using one AI model to grade work against a rubric, often called LLM-as-a-judge.

When the rubric is well-built, these judges line up with human reviewers a large share of the time. The hard part was never the grading. It was writing a rubric tied to real decision criteria and real failure modes, rather than vague notions of "good writing."

Step Four: Produce a Scorecard Built for Action.

This is where the team spent its design effort.

The output is a structured scorecard: a launch-readiness rating, a dimension-by-dimension assessment, and a single "start here" pointer to the most important fix.

For each gap, the Evaluator does three things.

  1. It names what is missing.

  2. It offers write-ready replacement text the PM can drop straight in.

  3. And it cites evidence, a linked doc, or a prior experiment, so the suggestion isn't just an opinion.

Then it splits the work into critical requirements vs. optional optimisations.

The Decision that Mattered Most

Ask the team what made the Evaluator work, and the answer is not the score. It is actionability. PMs get almost nothing from feedback like "be more specific" or "think through downside risk."

So the Evaluator was built to convert critique into revision: define the baseline, name the target, add the guardrail, scope the first release more narrowly, make the dependency explicit.

That single choice changes the workflow from passive critique into active improvement. It's the difference between a reviewer who tells you the essay is weak and one who shows you the paragraph to rewrite.

Two more decisions shaped how honest the tool feels.

  1. The team set hard boundaries on what counts as a "critical" gap, so the Evaluator wouldn't politely call a PRD review-ready when the fundamentals were missing.

  2. They treated prioritisation as part of the product itself because a tool that flags everything as important is just noise wearing a suit.

There's a tell in their framing worth sitting with.

They explicitly did not want a writing tool that rewards polished prose, because a PRD can read beautifully and still miss the context, framing, and decision logic that determine whether it survives review. The Evaluator grades the thinking.

What It Actually Changed

Uber has been careful not to oversell results, and so should we: there are no published accuracy figures or hard productivity numbers yet.

What the team reports is early internal adoption by dozens of PMs across the company and a consistent pattern in how the work changed. PMs caught blind spots earlier.

They pressure-tested growth assumptions before a senior reviewer had to. They saw how a change might ripple into adjacent systems outside their own surface.

And the signal the team cared about most was that the review conversations themselves got sharper and faster, because people stopped spending the first 20 minutes rebuilding context. Just as telling is where Uber drew the line.

The Evaluator does not make approval decisions, and it does not replace domain experts. It sits upstream, strengthening the artefact before expert review.

In a Nutshell

This is not Uber's first tool of this shape. It is a sibling to uReview, the company's AI code reviewer. It’s the same move, pointed at product documents instead of code.

In both cases, the AI doesn't make the call. It raises the quality of what reaches the human who does. That is the part likely to outlast this one tool.

The instinct with AI is to ask what decisions it can take over. Uber's bet is quieter and, for now, more useful: aim it at the input to a human decision, not the decision itself.

Assemble the context no single person can hold, surface the blind spots, and let the experts spend their judgment on judgment. There's even a small wink in the blog itself.

Uber notes that the post's cover image was made by Gemini and its scorecard examples by Claude. The tool that reviews how Uber builds products was written up with help from the same kind of models it runs on.

Login or Subscribe to participate

Reply

Avatar

or to participate

Keep Reading