Vendor Evaluation Guide

How to Evaluate LLM Vendors

Without Getting Burned

Every AI vendor promises 95% accuracy. Few can prove it. This guide teaches you how to define your requirements precisely enough that vendors can't hide behind vague proposals—and you get estimates you can actually hold them to.

First step: Before evaluating vendors, make sure your AI idea is sound. Validate your AI project idea to identify risks and requirements upfront.

Build Your Metrics

View Full Checklist

Why Clear Requirements Matter

When requirements are vague, vendors have to make assumptions to write a proposal. Those assumptions create misalignment that only surfaces months later. The more precisely you define what success looks like upfront, the more accurate your estimates and the smoother your project.

Phase

Vague Requirements

Defined Metrics

Proposals

Comparing slide decks and promises

Comparing responses to your specific criteria

Estimates

Vague timelines: '3-6 months'

Estimates tied to each deliverable

Accuracy claims

Everyone claims 95%

Accuracy defined: what counts, how it's tested

Development

No way to measure progress

Weekly scores against your test suite

Deployment

Demo looks great, production fails

Contract tied to passing acceptance criteria

Core Insight

Define Your Metrics Before Talking to Vendors

The Evaluation Builder isn't about "reviewing" vendors—it's about simplifying the entire conversation.

When you show up with decomposed metrics (not just "we want good answers"), vendors can give you specific estimates: "Tone matching will take 2 weeks. Accurate citation extraction is harder—4 weeks minimum, and here's why."

Without this, you get black-box proposals where neither you nor the vendor really knows what "done" looks like until you're three months in and burning budget.

What you'll have after using the Evaluation Builder:

Specific metrics: Accuracy, Tone, Format, Safety—each defined
Weighted scoring that reflects your actual priorities
Test cases vendors must pass before you pay
Language to put in your contract's acceptance criteria

Build Your Evaluation Metrics

Real-World Anti-Patterns

Proposals That Should Make You Pause

These aren't hypotheticals—these patterns show up repeatedly in vendor proposals. Each one adds cost, complexity, and risk without adding value.

Misusing Agents for Mandatory Steps

What They Propose:

"We'll build an AI agent with a search_knowledge_base tool. The agent will decide when to use it based on the user's question."

Why it's a problem: Agents are great for choosing between optional actions. But a support bot always needs knowledge base context—that's not a decision, it's a requirement. Making the AI 'decide' to do something mandatory adds latency and creates a failure mode where it forgets.

What to require instead: Use deterministic code for steps that must always happen (like RAG retrieval). Reserve agentic tool-calling for genuinely optional actions where the AI's judgment adds value.

The 'Mega-Prompt' Approach

What They Propose:

"Our system uses a carefully crafted prompt that handles query understanding, retrieval augmentation, response generation, tone matching, and safety filtering in a single LLM call."

Why it's a problem: This is a maintenance nightmare. When accuracy drops, you can't tell if it's retrieval, generation, or formatting. Fixing one thing breaks three others. You're debugging a black box.

What to require instead: Decomposed pipeline with isolated steps. Each step testable independently. When retrieval fails, you know it's retrieval. When tone is off, you fix the tone step. No domino effects.

The 'LLM Security Guard'

What They Propose:

"The system prompt instructs the model to only access documents the user is authorized to see based on their role description in the conversation."

Why it's a problem: LLMs cannot reliably enforce security boundaries. This approach will pass demos and fail audits. When it fails, you'll need a complete architecture rewrite—not a prompt tweak.

What to require instead: Programmatic access control at the retrieval layer. Filter documents by user permissions before they ever reach the LLM. The model should never see data the user shouldn't access.

See the full checklist of red flags and requirements

Questions That Separate Experience from Sales

These are hard to fake. Vendors with real production experience answer them instantly. Those without will give you vague, deflecting responses.

"Walk me through your chunking strategy."

Good: Explains chunk sizes, overlap, why those choices for your data type

Red flag: 'We use best practices' or 'Our system handles that automatically'

"How do you handle document versioning?"

Good: Specific process for when your knowledge base updates

Red flag: 'We can discuss that in phase 2'

"Show me your observability stack."

Good: Screenshots of trace logs, token tracking, feedback capture

Red flag: 'We monitor everything' with no specifics

"What's your regression testing approach?"

Good: Automated test suite that runs on every prompt/model change

Red flag: 'We manually review outputs before deployment'

"How do you define and measure accuracy?"

Good: Decomposed metrics: factual correctness, citation accuracy, format compliance

Red flag: 'We use GPT-4 to grade outputs' with no rubric

"What happens when a prompt fix breaks something else?"

Good: 'Our test suite catches it. Here's an example from last month.'

Red flag: 'That rarely happens' or 'We're very careful'

The Evaluation Process

Define Your Metrics

Use the Evaluation Builder to decompose 'high quality' into testable criteria. Don't talk to vendors until you know what you're measuring.

Build metrics

Prepare Your Golden Set

Create test cases with 80/20 split. Share 20% with vendors, keep 80% private. This prevents 'teaching to the test' and shows how they handle real edge cases.

Request Specific Proposals

Send vendors your metrics and requirements. Ask for estimates tied to each metric, not just 'the whole project'. Compare responses apples-to-apples.

Contract with Acceptance Criteria

Put your metrics in the contract. Payment milestones tied to passing your private test suite. No more subjective 'it looks good' approvals.

See contract checklist

Stop Comparing Slide Decks

Define what success looks like before you talk to vendors. You'll get better proposals, more accurate estimates, and contracts you can actually enforce.

Build Evaluation Metrics

Validate Your AI Idea First