LLM Vendor Selection & Contract Requirements
Nuanced evaluation for real-world systems
Moving beyond generic benchmarks. This checklist focuses on robust evaluation methodology, strategic decomposition, and practical contract KPIs.
Vendor Proposal Evaluation (Red Flags & Warnings)
Manual Evaluation Suggested
Red FlagIf a vendor suggests 'manual spot-checking' as the primary validation method.
Why this matters
Manual evaluation is a never-ending process and a massive red flag. One 'fix' in a prompt will break five other things without an automated test suite.
Lack of Problem Decomposition
Red FlagA single 'mega-prompt' trying to handle reasoning, retrieval, and formatting in one step.
Why this matters
High risk of 'prompt oscillation'-where fixing one failure mode causes three new ones. Small, manageable steps allow for granular testing and the use of faster, cheaper models.
Security Managed by LLM Logic
Red FlagRelying on the LLM's system prompt or 'judgment' to decide if it should access or expose sensitive data from a connected source.
Why this matters
Security must be enforced programmatically (e.g., via metadata filtering or access control lists at the retrieval layer). LLMs cannot reliably enforce security boundaries. Solutions designed this way often require a complete architecture rewrite to fix.
No Monitoring or Observability Plan
Red FlagMissing plans for trace logging, token cost tracking, or user feedback capture.
Why this matters
This often indicates a lack of real-world production experience. If you can't see what the LLM did at step 2 of 5, you can't debug it when it fails in the wild.
Vague or General Evaluation Criteria
WarningUsing generic terms like 'high quality' or 'helpful' without defining atomic, measurable metrics.
Why this matters
If evaluation is too general, 'success' becomes subjective, and you'll get stuck in an endless loop of subjective 'refinements' that never reach production readiness.
Unnecessary Agentic Elements
WarningUsing autonomous agents to decide tool usage when the logic is deterministic.
Why this matters
Every extra LLM call adds latency and cost. If a tool MUST be called (e.g., searching your knowledge base), use code, not an agentic decision.
Vague Update/Maintenance Strategy
Nice to haveNo plan for how to handle schema changes or document versioning in the knowledge base.
Why this matters
A static RAG demo is easy. A system that stays accurate as your underlying data changes is a core production requirement.
Evaluation Requirements & Methodology
Golden Set Strategy (80/20 Split)
RequiredPrepare a generous set of questions/answers. Keep 80% private during development and share only 20% with the vendor to avoid 'overfitting' the solution to the test.
Why this matters
If a vendor optimizes against the full test set, you won't know if the system actually generalizes to real users until it moves to production.
Decomposed Accuracy Metrics
RecommendedInstead of one 'accuracy' score, use weighted multi-level metrics (e.g., 'Did it cite source?', 'Is tone correct?', 'Is formatting valid?').
Why this matters
Composition of simple, binary questions provides the most robust evaluation of overall quality.
PoC Accuracy Thresholds
Nice to haveDon't demand 95% for a PoC. Define what '70%' actually means-e.g., 30% failure to answer vs. minor formatting issues.
Why this matters
Premature optimization for 95% accuracy can lead to brittle prompts and skyrocketing costs before the core value is even proven.
Automated Regression Testing
RequiredVendor must provide a framework to automatically re-verify the Golden Set after any change to the model, architecture, or retrieval pipeline.
Contract: Performance & Operations
Guardrails Implemented
RequiredStrict requirements for PII masking, safety filtering, and tone consistency checks.
Average & Median Latency
RecommendedTargets for typical generation times. Important for production, but secondary to accuracy during PoC phase.
Availability (Uptime)
Nice to haveTarget 99.5%+, but recognize AI infra is volatile. Nice to have, not strictly required for early phases.
P95 / P99 Latency
Nice to haveUseful for tracking outliers and tail-end performance issues.
Proposal Scoring Rubric
Architecture Completeness
RequiredDetailed chunking, embedding, retrieval (rerankers), and context injection strategy.
Cost Model Accuracy
RecommendedBreakdown of token costs, infrastructure, and vector DB hosting at scale.
Reference Checks
Nice to haveSpeak to a customer who has been in production (not just POC) for >3 months.