AI Product Development: Navigating Uncertainty in Probabilistic Features

AI is changing how we build products. Traditional loops like Build, Verify, Measure break down when what we build is probabilistic. In this article, I explore how AI disrupts standard product workflows, then I propose a solution.

Published on

July 13, 2025

Introduction

If we had to boil down the product development process into 3 steps, it could be: Build‑Verify‑Measure. We build something, we verify it works, we measure its impact on the business.

The point of this article is to explain how this product development loop breaks when building AI-powered features, and offer some solutions. We will start by going over the Build‑Verify‑Measure loop, then we’ll dive into the reasons why AI disrupts it. Finally, we’ll propose solutions and flag some risks.

Side note: below you’ll see that non-ai features are sometimes called deterministic features, as opposed to probabilistic ones that describe ai-based features. More on that later.

The traditional loop for deterministic features

I am keeping Strategy and Research phases outside this article to keep it short. As we were saying, product builders cycle through 3 execution‑focused phases: Build, Verify, Measure.

Build the product – Designers and engineers create and implement a solution and release it.
Verify the output – QA confirms that the feature behaves as specified.
Measure the outcome – The team analyses adoption and impact through analytics and user interviews.

For deterministic features, this loop works perfectly because there is no uncertainty between step 1 and 2. Here is an example. We create a new way for users to re-color a chart. Once this feature is coded and tested, we can be confident that every user sees the same thing. This feature is deterministic.

AI features break the loop

AI outputs in contrast are stochastic. The model rolls a dice for each answer generated. Even if two people ask the exact same question – enter the same input – they may receive different answers.

This variability can be explained by:

LLMs inner working: LLMs probabilistically choose from many plausible tokens instead of locking into a single fixed answer. This creates a distribution of possible answers.
Models evolve: Like most tech firms, LLM providers update their models. Yesterday’s model weights – a key determinant of the distribution of possible answers – may be different than today’s.
User‑level context: Private documents, previous chat messages… it all nudges responses in unique directions.
The agentic loop: With each autonomous agent call, the distribution of possible answers multiplies.

Because of this, the traditional Verify step breaks. We are no longer testing “product paths” in a binary fashion. In this new world of probabilistic features, product teams evaluate a distribution of possible outputs and decide which segment is acceptable.

New phase needed: Calibrate

Should we move to Build > Calibrate > Verify > Measure ?

The goal of the calibrate phase

If we run an AI agent 10,000 times, we get 10,000 answers of varying quality and we can plot a distribution. The horizontal axis measures the quality of the answer (more on that later), the vertical axis measures the quantity of answers with these scores. The shape of this distribution will vary is 2 ways:

Kurtosis: how concentrated are the scores?
Skewness: where are the scores concentrated?

With these two statistical concepts in mind, what’s the goal in the context of building AI-powered features? It’s to increase the kurtosis and skew the distribution towards the desired quality.

The tasks of the calibrate phase

Define the characteristics of a correct answer.
Collect a sample of input-output pairs.
Run the test, score outputs and build the distribution.
Identify and test the fundamental determinants of the distribution.

1/ Define the characteristics of a correct answer

Most AI tools will have more than one use case. For each use case, you identify what good answers look like. Here is how Harvey AI measures quality. According to their keynote from Interrupt, they look at 4 axis:

Substance: How much true information was present in the answer?
Hallucination: How much false information was present in the answer?
Structure: Does it contain tables, long form text, citations…?
Style: Is the voice-and-tone respecting the brand guidelines?

2/ Sample set collection

A sample set is a collection of input-output pairs. A key determinant of a good sample set is the variety of its inputs. An input can be as simple as a user question but in the agentic world, input also means the data retrieved from tool calls.

We must draft a series of inputs that cover best‑cases (well formatted questions and access to well formatted data) and edge‑cases (poorly formatted questions and access to poorly formatted data). Let’s take the example of a conversational “Can I surf now?” assistant. The sample-set should span both precise and fuzzy user requests, coupled with weather data from perfect to marginal surf conditions, as well as incomplete weather data.

Good news, you may not have to create your entire sample set. There may be publicly available sets for your industry/use case. For example, Harvey AI used LegalBench for a while, an eval set dedicated to legal questions. Side note: they have since moved away from LegalBench, and built a more complex sample set.

3/ Run, score, plot:

We send all inputs to the model, log responses, and grade. If your grading system has 5 categories, you will get 5 distributions.

Who does the scoring? Humans, deterministic code, and LLMs.

Humans will be the best at scoring nuances, but it is slower and more expensive. Simple code could also help for easy tasks (ex: how many correct sources did the answer contain). You can also use an LLM to grade LLM answers, often referred to as the “LLM-as-a-Judge” approach. This approach relies on a grading prompt: you are tasking an LLM to grade an answer with a prompt. Craft this “grading” prompt carefully because once you change that prompt, you cannot reliably compare scores coming from different grading prompts.

4/ Identify the fundamental determinants of the distribution, then test:

As Andrew Ng puts it in his Interrupt talk: having the right intuition on what to change in your AI stack in order to get higher quality outputs is a very hard thing to do. This is concerning because that’s Andrew Ng saying that.

Lets recap our work so far: we defined what’s a good answer, we gathered samples, we ran tests, and we know how good our answers are. Now what?

I don’t see a clear method, but I see groups of actions.

Additive tasks: We could add new tools to our agent, new data, more content and context.
Subtractive tasks: We can build deterministic filters to prevent some content from landing in answers.
Transformative tasks: We can adjust the feature prompt, system prompt, the temperature, the available tools and how they work (what they take in and spit out)…

I think this is where experience matters. Teams need to hire people who’ve done it before.

Collaboration at the calibrate phase

Ad‑hoc vs role-aligned ownership

Because the field is so new, many teams have not yet formalized how these AI tasks fit into their current process. The engineer who wires up the LLM often can end up writing prompts and picking success criteria. Ad‑hoc ownership is fine at the prototype phase, but not for production. And without a good process in place, as the roadmap heats up, some important but not crucial tasks could go out the window.

As we’ve broken down the new tasks introduced by AI calibration, it becomes easier to see which non-engineering functions can contribute meaningfully. The list of functions below is not exhaustive. It reflects roles I’ve had the chance to work closely with, and where I’ve seen clear value added. But every function should evaluate how their strengths might map to this new reality.

Product Designer

AI introduces unpredictability and product designers help tame it. Their role shifts from just structuring interfaces to managing the UI variance and value expectations.

Product designers define how AI outputs should appear — whether as charts, tables, summaries, or UI widgets — and under what conditions each format is appropriate. More than just picking layouts, they set the guardrails: bounding variability, designing fallback states, and ensuring that even when outputs shift, the interface remains coherent and intentional.

As products move into agentic AI, designers contribute to a new kind of ideation: How should tools appear? What does invocation look like? How do we show that the AI is taking action on behalf of the user?

Content Designer

Content designers ensure AI-generated responses align with the product’s tone, style, and communication principles. Their expertise in voice and clarity makes them natural owners of language quality.

In the input design phase, they help craft and iterate both user-facing prompts and system prompts to guide AI behavior toward desired tone and structure.

On the output side, content designers help define what a “well-written” answer looks like — applying their lens on clarity, readability, and actionability to build evaluation criteria. And when using LLMs as graders, they write consistent and precise scoring prompts that reflect UX and tone expectations, ensuring automated evaluations align with human judgment.

Product Manager

With non-AI features, PMs can rely on QA tests and launch checklists. With AI, that clarity disappears. Instead of asking “Does it work?”, they now ask “How often does it work — and is that enough?”

The team won’t be able to test every evaluation axis for every use case. Prioritization reaches a new level: which paths matter most? Which eval sets should be expanded? Where is scoring effort most valuable? Even with a noisy, incomplete quality signal, PMs will still need to make the ship call.

Subject‑Matter Experts

Probabilistic systems make things up and sound confident doing so. That’s why subject-matter experts (SMEs) are critical. In domains like finance and law for example, correctness is nuanced and only experts can draw the line.

SMEs are integral to defining the grading system, and can help increase the quality of the sample set. Also, in early calibration runs, SMEs act as human judges.

Risks worth calling out in probabilistic product development

Collaboration risk: ownership vacuum

Risk: In a team with more collaboration around AI, you could enter the following situation. No one owns AI quality across the stack, especially in non-engineering teams.
Impact: If this happens, you could have critical calibration steps (prompt crafting, scoring criteria, sample diversity) get skipped and your error rate increases unknowingly.
Anti-pattern (how a risk manifests in practice when things go wrong?): PMs wait for QA tests that can’t exist. Engineers ship prompts they wrote in a rush. Nobody flags model drift until customers complain.

Two technological risks

Prompt coupling

Risk: AI products evolve like regular ones. Features get added over time. As prompts grow to support more cases, logics from different features become entangled and a change for one scenario negatively impacts another.
Impact: Model behavior becomes unpredictable. Debugging gets harder over time. UX quality slowly degrades.
Anti-pattern: A designer adjusts a prompt to clean up tone in summaries. Unbeknownst to the team, it breaks a separate extractive task. Three weeks later, customer support logs spike — and no one links it to the prompt change.

User and model drift

Risk: User behavior start to diverge from the training data set. Or LLM providers update their model weights. What happens is that the set up that worked last month is broken today.
Impact: Gradual UX regression, often invisible. Features degrade silently. Calibration assumptions expire.
Anti-pattern: A legal AI tool was tuned on last year’s law corpus. A model update drops key definitions, but the eval set stays static. The team doesn’t catch it — until a client gets a hallucinated court precedent.

Design risk: UX black hole

Risk: With the current AI monitoring tools, it is easier to quantitatively measure the engineering side of the product, rather than to measure real user value. In this context, designers and PMs may progressively know less and less about how the product is doing.
Impact: You see token usage but not satisfaction. You chase cheap completions over good completions.
Anti-pattern: PMs launch a feature with strong model metrics. Feedback from user interviews says it's confusing — but no one checked. Meanwhile, engineering celebrates a 40% latency reduction.

Conclusion

You can’t ship AI like you ship buttons. The Build–Verify–Measure loop breaks when outputs vary and evolve. Yet too many teams treat stochastic features as if they’re deterministic — until they ship something broken…

The Calibrate phase isn’t just a new step, it’s a new way of thinking about product quality, moving away from pass/fail tests and toward output distribution assessment. This means rethinking roles as well. Content and Product Designers bound variance, PMs prioritize uncertainty, and SMEs become the backbone of quality control.

If your team doesn’t own calibration, it doesn’t own quality. And if you don’t know how well your AI works, neither do your users.

AI product development: navigating the uncertainty of probabilistic features