How to Evaluate LLM Vendors
Without Getting Burned
Every AI vendor promises 95% accuracy. Few can prove it. This guide teaches you how to define your requirements precisely enough that vendors can't hide behind vague proposals—and you get estimates you can actually hold them to.
First step: Before evaluating vendors, make sure your AI idea is sound. Validate your AI project idea to identify risks and requirements upfront.
Why Clear Requirements Matter
When requirements are vague, vendors have to make assumptions to write a proposal. Those assumptions create misalignment that only surfaces months later. The more precisely you define what success looks like upfront, the more accurate your estimates and the smoother your project.
Phase
Vague Requirements
Defined Metrics
Proposals
Comparing slide decks and promises
Comparing responses to your specific criteria
Estimates
Vague timelines: '3-6 months'
Estimates tied to each deliverable
Accuracy claims
Everyone claims 95%
Accuracy defined: what counts, how it's tested
Development
No way to measure progress
Weekly scores against your test suite
Deployment
Demo looks great, production fails
Contract tied to passing acceptance criteria
Define Your Metrics Before Talking to Vendors
The Evaluation Builder isn't about "reviewing" vendors—it's about simplifying the entire conversation.
When you show up with decomposed metrics (not just "we want good answers"), vendors can give you specific estimates: "Tone matching will take 2 weeks. Accurate citation extraction is harder—4 weeks minimum, and here's why."
Without this, you get black-box proposals where neither you nor the vendor really knows what "done" looks like until you're three months in and burning budget.
What you'll have after using the Evaluation Builder:
- Specific metrics: Accuracy, Tone, Format, Safety—each defined
- Weighted scoring that reflects your actual priorities
- Test cases vendors must pass before you pay
- Language to put in your contract's acceptance criteria
Proposals That Should Make You Pause
These aren't hypotheticals—these patterns show up repeatedly in vendor proposals. Each one adds cost, complexity, and risk without adding value.
Misusing Agents for Mandatory Steps
What They Propose:
"We'll build an AI agent with a search_knowledge_base tool. The agent will decide when to use it based on the user's question."
Why it's a problem: Agents are great for choosing between optional actions. But a support bot always needs knowledge base context—that's not a decision, it's a requirement. Making the AI 'decide' to do something mandatory adds latency and creates a failure mode where it forgets.
What to require instead: Use deterministic code for steps that must always happen (like RAG retrieval). Reserve agentic tool-calling for genuinely optional actions where the AI's judgment adds value.
The 'Mega-Prompt' Approach
What They Propose:
"Our system uses a carefully crafted prompt that handles query understanding, retrieval augmentation, response generation, tone matching, and safety filtering in a single LLM call."
Why it's a problem: This is a maintenance nightmare. When accuracy drops, you can't tell if it's retrieval, generation, or formatting. Fixing one thing breaks three others. You're debugging a black box.
What to require instead: Decomposed pipeline with isolated steps. Each step testable independently. When retrieval fails, you know it's retrieval. When tone is off, you fix the tone step. No domino effects.
The 'LLM Security Guard'
What They Propose:
"The system prompt instructs the model to only access documents the user is authorized to see based on their role description in the conversation."
Why it's a problem: LLMs cannot reliably enforce security boundaries. This approach will pass demos and fail audits. When it fails, you'll need a complete architecture rewrite—not a prompt tweak.
What to require instead: Programmatic access control at the retrieval layer. Filter documents by user permissions before they ever reach the LLM. The model should never see data the user shouldn't access.
Questions That Separate Experience from Sales
These are hard to fake. Vendors with real production experience answer them instantly. Those without will give you vague, deflecting responses.
"Walk me through your chunking strategy."
Good: Explains chunk sizes, overlap, why those choices for your data type
Red flag: 'We use best practices' or 'Our system handles that automatically'
"How do you handle document versioning?"
Good: Specific process for when your knowledge base updates
Red flag: 'We can discuss that in phase 2'
"Show me your observability stack."
Good: Screenshots of trace logs, token tracking, feedback capture
Red flag: 'We monitor everything' with no specifics
"What's your regression testing approach?"
Good: Automated test suite that runs on every prompt/model change
Red flag: 'We manually review outputs before deployment'
"How do you define and measure accuracy?"
Good: Decomposed metrics: factual correctness, citation accuracy, format compliance
Red flag: 'We use GPT-4 to grade outputs' with no rubric
"What happens when a prompt fix breaks something else?"
Good: 'Our test suite catches it. Here's an example from last month.'
Red flag: 'That rarely happens' or 'We're very careful'
The Evaluation Process
Define Your Metrics
Use the Evaluation Builder to decompose 'high quality' into testable criteria. Don't talk to vendors until you know what you're measuring.
Build metricsPrepare Your Golden Set
Create test cases with 80/20 split. Share 20% with vendors, keep 80% private. This prevents 'teaching to the test' and shows how they handle real edge cases.
Request Specific Proposals
Send vendors your metrics and requirements. Ask for estimates tied to each metric, not just 'the whole project'. Compare responses apples-to-apples.
Contract with Acceptance Criteria
Put your metrics in the contract. Payment milestones tied to passing your private test suite. No more subjective 'it looks good' approvals.
See contract checklistStop Comparing Slide Decks
Define what success looks like before you talk to vendors. You'll get better proposals, more accurate estimates, and contracts you can actually enforce.