AI Evaluation Metrics

Stop Grading by "Vibes"

A practical guide to LLM evaluation

If you're shipping an AI feature, you don't want to rely on manual "looks good" reviews. Quality needs to be measurable. But not every project needs every metric. We help you design a focused, production-friendly evaluation strategy that matches your specific app, budget, and business goals.

Design your custom metrics

Talk to us

Metrics should work for you

Reading a few conversations and saying "this seems fine" is a normal starting point. But relying on generic metrics can be artificial—your solution might look perfect on paper but fail to meet your users' actual needs.

Our View:

Don't just copy a checklist. Evaluation is a journey of finding the right balance between precision, cost, and speed. We focus on designing metrics that tell you the most with the lowest effort to implement.

Goal

Tailored to your scenario

Define what "good" looks like for your specific app, not what someone else recommended.

Outcome

Pick the right battle

If the budget is limited, focus on high-level E2E quality first, then drill down into granular details.

Reality

Practical over Perfect

A metric is only useful if it helps you make a better decision.

The secret weapon: LLM-as-a-Judge

In traditional software, you can write deterministic tests. With LLM outputs, you usually need a rubric instead of an exact string match. The solution: use a simple model to grade your complex application's output.

1

Interaction

A user asks a question. Your application answers.

2

Rubric

You define what "good" looks like—as explicit criteria.

3

Verdict

A simple judge model returns a structured score plus a short rationale.

Why judges are reliable: they do one simple thing

Your application model handles complex workflows. Your judge model answers one narrow grading question. This is why judges are consistent and cost-effective.

Application model

Complex · Multi-step · Frontier model (GPT-4, Claude 3.5)

Example task:

"Help me optimize our team's sprint planning process"

Understand intent and current constraints

Decide which knowledge bases to search

Retrieve relevant sprint methodology docs

Synthesize recommendations from 10+ sources

Format as structured action plan

Many responsibilities → more ways to fail

Judge model

Simple · Single task · Smaller model (GPT-4o-mini, Haiku)

Example grading task:

"Is this recommendation supported by the retrieved documents?"

Read the application's answer

Read the retrieved context

Apply single rubric: "Is claim supported?"

Return score (1-3) + one-sentence reason

Narrow scope → higher consistency + lower cost

💡 Cost-effectiveness:

Because judges do one simple task, you can use smaller, faster, cheaper models (GPT-4o-mini, Claude Haiku). You don't need frontier-class reasoning for a pass/fail check. This makes running thousands of evaluations practical and affordable.

Use structured outputs

Prefer boolean / 1-3 / 1-5 scales with short explanations.

Calibrate with examples

Provide a couple of "good" and "bad" samples so the judge anchors its scoring.

Audit periodically

Humans spot drift and edge cases; judges handle scale.

Sample judge metrics you can run

Judge scales are typically one of: boolean, 1-3, or 1-5. Keep them simple. Use libraries like Instructor or Zod to map your structured output—don't put JSON schemas in prompts.

Example Metric 1 of 3

Correctness (scale 1-3)

Compare the response to a reference answer. This is the gold standard for factual accuracy. Score: 1=incorrect, 2=partial, 3=correct.

Judge Prompt

You are a strict evaluator.

Compare the AI response to the reference answer.

Question: {question}
Reference Answer: {reference}
AI Response: {ai_response}

Score on a 1-3 scale:
1 = Incorrect (key facts conflict with reference)
2 = Partially correct (mostly right but missing/unclear details)
3 = Correct (all key facts match the reference)

Provide your score and reason.

Best practice: Keep judge prompts deterministic in spirit. Narrow scope. Explicit scale definitions. Use type-safe libraries (Instructor, Zod, Pydantic) to handle structured output—your prompt should focus on grading logic, not JSON formatting.

The "big three" metrics for RAG (the triad)

If you're building Retrieval-Augmented Generation, these three metrics cover most real-world failures.

RAG triad

Faithfulness (don't lie)

Question: Is the answer derived purely from the retrieved context?

High: "I can't find that info in the documents."

Low: Fabricates a phone number that isn't present.

RAG triad

Context relevance (reduce noise)

Question: Did you retrieve the right documents to answer the question?

High: "API Keys" → right doc retrieved.

Low: "API Keys" → office kitchen rules retrieved.

RAG triad

Answer relevance (be helpful)

Question: Did the answer address the user's intent?

High: "Reset password" → go to Settings → Security.

Low: Explains what passwords are, but not how to reset.

Functional metrics (hard requirements)

Not everything needs a judge model. Some requirements are pure code: pass/fail.

Deterministic

JSON validity

If your AI triggers actions, invalid JSON is a hard failure. Validate parsing and required fields in code.

Deterministic

Conciseness / verbosity

Keep responses within your token budget. Length limits protect cost and UX.

Deterministic

Latency (TTFT + total)

Measure time-to-first-token and total completion time. Performance regressions are quality regressions.

Level up: RAG-specific metrics (RAGAS-style)

Frameworks like RAGAS popularize common RAG metrics. The names vary, but the underlying questions are universal.

Faithfulness

Break the answer into claims. Score how many are supported by retrieved context.

Answer relevance

Does the answer stay on the user's question, or does it wander?

Context precision

Of retrieved docs, what fraction were actually relevant?

Context recall

Did retrieval find most of the relevant information available?

Avoid over-optimizing: Don't chase 100% just because a metric exists. If 70% context precision is good enough for your volume, and the final answers still meet quality standards, pushing for more might just waste budget and engineering time for a $0.001 per query saving.

Designing your evaluation roadmap

In an ideal world, you track every granular detail. In reality, especially when budget is limited, you should pick the small set of metrics that tell you the most about overall performance.

Phase 1: Automate E2E ScenariosFocus on overall solution quality. These tests are more expensive (running the whole flow) but give you the confidence to ship.

Phase 2: Add Granular MetricsAs budget allows, build precise metrics to find problematic areas. This makes it easier to apply targeted fixes without breaking other parts of the app.

How our Evaluation Builder helps

We guide you through an interactive process to define metrics tailored to your specific needs, avoiding generic "checklist" bloat.

Discovery: We ask tailored questions to surface your specific scenario goals.
Decompose: Break vague "vibes" into concrete, testable metrics.
Weigh Priority: Decide what matters most for your budget and user experience.
Interactive Rubrics: Refine grading criteria into judge-ready instructions.
Implementation Ready: Export a strategy you can actually run in production.

Start Interactive Design