Stop Grading by "Vibes"
A practical guide to LLM evaluation
If you're shipping an AI feature, you don't want to rely on manual "looks good" reviews. Quality needs to be measurable. But not every project needs every metric. We help you design a focused, production-friendly evaluation strategy that matches your specific app, budget, and business goals.
Metrics should work for you
Reading a few conversations and saying "this seems fine" is a normal starting point. But relying on generic metrics can be artificial—your solution might look perfect on paper but fail to meet your users' actual needs.
Our View:
Don't just copy a checklist. Evaluation is a journey of finding the right balance between precision, cost, and speed. We focus on designing metrics that tell you the most with the lowest effort to implement.
Tailored to your scenario
Pick the right battle
Practical over Perfect
The secret weapon: LLM-as-a-Judge
In traditional software, you can write deterministic tests. With LLM outputs, you usually need a rubric instead of an exact string match. The solution: use a simple model to grade your complex application's output.
Interaction
Rubric
Verdict
Why judges are reliable: they do one simple thing
Your application model handles complex workflows. Your judge model answers one narrow grading question. This is why judges are consistent and cost-effective.
Application model
Complex · Multi-step · Frontier model (GPT-4, Claude 3.5)
Example task:
"Help me optimize our team's sprint planning process"
Many responsibilities → more ways to fail
Judge model
Simple · Single task · Smaller model (GPT-4o-mini, Haiku)
Example grading task:
"Is this recommendation supported by the retrieved documents?"
Narrow scope → higher consistency + lower cost
💡 Cost-effectiveness:
Because judges do one simple task, you can use smaller, faster, cheaper models (GPT-4o-mini, Claude Haiku). You don't need frontier-class reasoning for a pass/fail check. This makes running thousands of evaluations practical and affordable.
Use structured outputs
Calibrate with examples
Audit periodically
Sample judge metrics you can run
Judge scales are typically one of: boolean, 1-3, or 1-5. Keep them simple. Use libraries like Instructor or Zod to map your structured output—don't put JSON schemas in prompts.
Correctness (scale 1-3)
Compare the response to a reference answer. This is the gold standard for factual accuracy. Score: 1=incorrect, 2=partial, 3=correct.
You are a strict evaluator.
Compare the AI response to the reference answer.
Question: {question}
Reference Answer: {reference}
AI Response: {ai_response}
Score on a 1-3 scale:
1 = Incorrect (key facts conflict with reference)
2 = Partially correct (mostly right but missing/unclear details)
3 = Correct (all key facts match the reference)
Provide your score and reason.Best practice: Keep judge prompts deterministic in spirit. Narrow scope. Explicit scale definitions. Use type-safe libraries (Instructor, Zod, Pydantic) to handle structured output—your prompt should focus on grading logic, not JSON formatting.
The "big three" metrics for RAG (the triad)
If you're building Retrieval-Augmented Generation, these three metrics cover most real-world failures.
Faithfulness (don't lie)
Question: Is the answer derived purely from the retrieved context?
High: "I can't find that info in the documents."
Low: Fabricates a phone number that isn't present.
Context relevance (reduce noise)
Question: Did you retrieve the right documents to answer the question?
High: "API Keys" → right doc retrieved.
Low: "API Keys" → office kitchen rules retrieved.
Answer relevance (be helpful)
Question: Did the answer address the user's intent?
High: "Reset password" → go to Settings → Security.
Low: Explains what passwords are, but not how to reset.
Functional metrics (hard requirements)
Not everything needs a judge model. Some requirements are pure code: pass/fail.
JSON validity
Conciseness / verbosity
Latency (TTFT + total)
Level up: RAG-specific metrics (RAGAS-style)
Frameworks like RAGAS popularize common RAG metrics. The names vary, but the underlying questions are universal.
Faithfulness
Answer relevance
Context precision
Context recall
Avoid over-optimizing: Don't chase 100% just because a metric exists. If 70% context precision is good enough for your volume, and the final answers still meet quality standards, pushing for more might just waste budget and engineering time for a $0.001 per query saving.
Designing your evaluation roadmap
In an ideal world, you track every granular detail. In reality, especially when budget is limited, you should pick the small set of metrics that tell you the most about overall performance.
Phase 1: Automate E2E ScenariosFocus on overall solution quality. These tests are more expensive (running the whole flow) but give you the confidence to ship.
Phase 2: Add Granular MetricsAs budget allows, build precise metrics to find problematic areas. This makes it easier to apply targeted fixes without breaking other parts of the app.
How our Evaluation Builder helps
We guide you through an interactive process to define metrics tailored to your specific needs, avoiding generic "checklist" bloat.
- Discovery: We ask tailored questions to surface your specific scenario goals.
- Decompose: Break vague "vibes" into concrete, testable metrics.
- Weigh Priority: Decide what matters most for your budget and user experience.
- Interactive Rubrics: Refine grading criteria into judge-ready instructions.
- Implementation Ready: Export a strategy you can actually run in production.