“Three Judges Walk Into a Bar: How LLM-Evals Stop Your AI from Going Rogue”
In an age where generative AI can go from genius to gibberish overnight, product managers face a new challenge: taming unpredictable features. LLM-Evals offer a practical path to reliable AI — minus the drama.

In 2019, the cinema world was buzzing about “Captain Marvel.” Even before the movie hit theaters, the review site Rotten Tomatoes became a battleground between professional critics and anonymous users. A flood of negative reviews appeared, most written by people who hadn’t even seen the film. Marvel Studios was stunned — a blockbuster was under a targeted review attack by an audience that hadn’t yet viewed it.
Faced with this crisis, Rotten Tomatoes realized they had released a measurement bug and decided to incorporate a major product change. They highlighted the gap between critic scores and audience scores and introduced new metrics to evaluate films. While social media eventually moved on to the next scandal, one critical lesson stayed: the destructive power of poor measurement — or worse, inaccurate measurement — that can create a completely distorted picture.
Now, imagine launching your brilliant new product, only to have it deliver inaccurate responses, communicate in the wrong language, or simply send out generic replies like a broken record. The excitement of going live turns into deep anxiety about bringing to life a useless feature. There’s good reason to worry. We’re in the middle of an AI tsunami — language models flooding us with all kinds of content, some brilliant and others… not so much.
So, how can we ensure our AI model behaves consistently and effectively in production over time?
What are LLM Evaluations (LLM-Evals)?
In language-model-based products, user value directly depends on the quality of the model’s output. Unlike traditional machine learning models, with LLM-driven features there’s often no “one correct answer.” Different outputs can be equally excellent. Subject matter experts (SMEs) typically assess whether responses are contextually useful, factually accurate, and appropriate in tone and style.
So we need a flexible evaluation mechanism capable of determining if a model’s output is sufficiently good — whether a generated summary is truly accurate, or an assistant’s answer is not misleading. However, evaluating the output isn’t straightforward. Its quality depends on various parameters, such as:
- Clarity and fluency
- Truthfulness and factual accuracy
- Appropriate user tone (e.g., courteous and service-oriented)
- Potential toxicity or unwanted bias
LLM-Evals are a way to automated quality assessments of language model outputs — without needing countless subject-matter experts to manually evaluate every response.
In this post, I’ll focus on a popular evaluation method called “LLM-as-a-Judge,” in which the generated text is sent to another language model for evaluation. This “judge model” enables scalable yet nuanced content assessment.
Why Are LLM-Evals Gaining Attention?
In previous year, competitive advantage lay in unique code. Today, the ability to generate code is widespread — “everything is programmable,” as Hubspot founder Dharmesh Shah put it. Even proprietary organizational data, once a strong defensive moat, is losing exclusivity as high-quality synthetic data becomes increasingly viable for training models.
So how do investors identify promising startups when everything seems possible? Accelerators like Y Combinator, home to companies like Airbnb and Stripe, now evaluate AI startups based on their AI evaluation infrastructure. Robust evaluation frameworks indicate genuine differentiation in an era where overnight competitors can easily emerge. Ultimately, the ability to measure and improve model output quality distinguishes successful products from mediocre ones.
Whose Responsibility Is It Anyway?
In the past, feature quality was automatically associated with developers and QA teams — after all, it was about code. Today, nearly 100% of the value of LLM-based features depends on textual output quality. Consequently, defining acceptance criteria and model quality standards — and determining what users consider “good” — is now the product manager’s responsibility.
Product leaders from companies like OpenAI and Anthropic highlight AI evaluations as a key aspect of modern product management roles. Accountability for quality is shifting from developers to product managers. In this new era, where competitors use similar models, real value emerges from solutions tailored to specific use-cases. Product managers, deeply familiar with personas, business processes, and real-world needs, can craft precise evaluation frameworks to ensure genuinely valuable outputs.
How Does It Work in Practice?
The core tool is the Golden Dataset, a benchmark used to evaluate model outputs. It includes scenarios simulating user inputs, the model’s outputs, and quality scores. For instance, chatbot scenarios might involve user queries, each paired with defined responses graded according to preset criteria. The Golden Dataset serves as our “ground truth,” guiding the judge model in distinguishing between high and low-quality outputs.
Once criteria are clear, the next step is creating an Evaluation Prompt — the rules guiding the judge model’s assessment. Depending on complexity, we might use detailed prompts or multiple prompts. Evaluation types include:
- Comparative Judge: Compares two outputs, ideal for A/B testing or evaluating different models.
- Generic Criteria Judge: Assesses objective parameters such as clarity, tone, verbosity and other generic attributes.
- Contextual Judge: Checks responses against “absolute truths” that are tailored to a specific context (e.g., precise opening hours).
Isn’t It Problematic for Models to Judge Themselves?
Evaluation tasks differ significantly from generation tasks. Generation involves complex creativity amidst uncertainties, whereas evaluation is a focused classification based on clear rules.
Still, letting a model judge itself might seem questionable. The common practice involves using another model as judge. This approach isn’t perfect — studies show biases between large models. An alternative, niche method called “LLM-as-a-Jury” uses multiple smaller models evaluating the same output, averaging results to reduce bias and cost significantly.
Interestingly, different language models reach about 80% agreement — the same level human experts achieve manually.
So, What Should You Do Tomorrow?
- Define Clear Requirements: Set explicit criteria for acceptable responses (accuracy, tone, clarity).
- Build an Initial Golden Dataset: Start small, with a few dozen or hundreds of realistic examples.
- Write Evaluation Prompts: Clearly instruct your judge model on how to assess each criterion.
- Run, Compare, Iterate: Continuously refine your main and evaluation prompts based on results.
Our goal as product people isn’t just an impressive demo, but consistent quality long-term — even when switching models. Ideally, use separate judges for each criterion to ensure the model can evaluate clarify issues precisely; defining clear, binary or simple-scale evaluations minimizes judgment errors or inconsistencies.
A tip: Always consider whether multiple rational human evaluators would reach similar conclusions based on your evaluation rules. If the answer is “yes,” then your rules are robust enough.
We’re in an “anything is possible” era, thanks to AI — but such flexibility also risks undesirable outcomes without proper evaluation. Build precise Golden Datasets, craft intelligent evaluation prompts, and regularly employ robust evaluation frameworks. This rigorous approach differentiates cute demo products from market-conquering innovations.
In the next post, we’ll elevate AI evaluations further by exploring “judges” for AI Agents, where the model must navigate complexity effectively. Until then, may the evals be ever in your favor!