Vendor Selection & Contracts

LLM Vendor Selection & Contract Requirements

Nuanced evaluation for real-world systems

Moving beyond generic benchmarks. This checklist focuses on robust evaluation methodology, strategic decomposition, and practical contract KPIs.

Vendor Proposal Evaluation (Red Flags & Warnings)

Manual Evaluation Suggested

Red Flag

If a vendor suggests 'manual spot-checking' as the primary validation method.

Why this matters

Manual evaluation is a never-ending process and a massive red flag. One 'fix' in a prompt will break five other things without an automated test suite.

Lack of Problem Decomposition

Red Flag

A single 'mega-prompt' trying to handle reasoning, retrieval, and formatting in one step.

Why this matters

High risk of 'prompt oscillation'-where fixing one failure mode causes three new ones. Small, manageable steps allow for granular testing and the use of faster, cheaper models.

Security Managed by LLM Logic

Red Flag

Relying on the LLM's system prompt or 'judgment' to decide if it should access or expose sensitive data from a connected source.

Why this matters

Security must be enforced programmatically (e.g., via metadata filtering or access control lists at the retrieval layer). LLMs cannot reliably enforce security boundaries. Solutions designed this way often require a complete architecture rewrite to fix.

No Monitoring or Observability Plan

Red Flag

Missing plans for trace logging, token cost tracking, or user feedback capture.

Why this matters

This often indicates a lack of real-world production experience. If you can't see what the LLM did at step 2 of 5, you can't debug it when it fails in the wild.

Vague or General Evaluation Criteria

Warning

Using generic terms like 'high quality' or 'helpful' without defining atomic, measurable metrics.

Why this matters

If evaluation is too general, 'success' becomes subjective, and you'll get stuck in an endless loop of subjective 'refinements' that never reach production readiness.

Unnecessary Agentic Elements

Warning

Using autonomous agents to decide tool usage when the logic is deterministic.

Why this matters

Every extra LLM call adds latency and cost. If a tool MUST be called (e.g., searching your knowledge base), use code, not an agentic decision.

Vague Update/Maintenance Strategy

Nice to have

No plan for how to handle schema changes or document versioning in the knowledge base.

Why this matters

A static RAG demo is easy. A system that stays accurate as your underlying data changes is a core production requirement.

Evaluation Requirements & Methodology

Golden Set Strategy (80/20 Split)

Required

Prepare a generous set of questions/answers. Keep 80% private during development and share only 20% with the vendor to avoid 'overfitting' the solution to the test.

Why this matters

If a vendor optimizes against the full test set, you won't know if the system actually generalizes to real users until it moves to production.

Decomposed Accuracy Metrics

Recommended

Instead of one 'accuracy' score, use weighted multi-level metrics (e.g., 'Did it cite source?', 'Is tone correct?', 'Is formatting valid?').

Why this matters

Composition of simple, binary questions provides the most robust evaluation of overall quality.

PoC Accuracy Thresholds

Nice to have

Don't demand 95% for a PoC. Define what '70%' actually means-e.g., 30% failure to answer vs. minor formatting issues.

Why this matters

Premature optimization for 95% accuracy can lead to brittle prompts and skyrocketing costs before the core value is even proven.

Automated Regression Testing

Required

Vendor must provide a framework to automatically re-verify the Golden Set after any change to the model, architecture, or retrieval pipeline.

Contract: Performance & Operations

Guardrails Implemented

Required

Strict requirements for PII masking, safety filtering, and tone consistency checks.

Average & Median Latency

Recommended

Targets for typical generation times. Important for production, but secondary to accuracy during PoC phase.

Availability (Uptime)

Nice to have

Target 99.5%+, but recognize AI infra is volatile. Nice to have, not strictly required for early phases.

P95 / P99 Latency

Nice to have

Useful for tracking outliers and tail-end performance issues.

Proposal Scoring Rubric

Architecture Completeness

Required

Detailed chunking, embedding, retrieval (rerankers), and context injection strategy.

Cost Model Accuracy

Recommended

Breakdown of token costs, infrastructure, and vector DB hosting at scale.

Reference Checks

Nice to have

Speak to a customer who has been in production (not just POC) for >3 months.