"It looks good" is not an evaluation. If you ship LLM features, you need a repeatable way to measure quality so you can tell whether a prompt tweak, a model swap, or a temperature change actually helped. This article lays out practical evaluation methods you can run today against the Model Database API.
Start with a labeled dataset
Collect 50–200 real inputs that represent your traffic, including the awkward edge cases. For each, record what a good answer looks like, an exact expected value, an acceptable range, or a rubric. This fixed set is your regression suite; every change gets scored against the same examples so results are comparable.
Choose the right scoring method
- Exact / structural match: for classification, extraction, or anything with a known answer. Cheap and deterministic.
- Heuristic checks: does the output parse as JSON, contain a required field, stay under a length cap, avoid a banned phrase?
- Semantic similarity: embed the output and reference, compare with cosine similarity for free-form text.
- LLM-as-judge: ask a strong model to score against a rubric when the task is subjective.
Deterministic checks first
Most failures are catchable without a model. Run cheap assertions before anything fancy.
def grade(output, case):
checks = {
"is_json": is_valid_json(output),
"has_fields": all(k in output for k in case["required"]),
"within_len": len(output) <= case["max_len"],
}
return checks, all(checks.values())
LLM-as-judge, done carefully
For quality dimensions like helpfulness or faithfulness, a judge model scales better than human review. Give it a narrow rubric and force a structured verdict.
from openai import OpenAI
client = OpenAI(base_url="https://modeldatabase.com/v1", api_key="mdb_live_...")
JUDGE = """Score the answer 1-5 for faithfulness to the context.
5 = every claim supported; 1 = mostly unsupported.
Return JSON: {"score": int, "reason": string}."""
def judge(context, answer):
r = client.chat.completions.create(
model="anthropic/claude-opus-4-8",
temperature=0,
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": JUDGE},
{"role": "user", "content": f"Context:\n{context}\n\nAnswer:\n{answer}"},
],
)
return r.choices[0].message.content
Use a different (ideally stronger) model as the judge than the one under test, keep temperature at 0, and validate the judge itself against a handful of human-labeled cases. Judges have biases, they favor longer answers and their own style, so don't treat the score as ground truth.
Track pass rate, not vibes
Run the whole suite, aggregate, and store the numbers. The metric that matters is pass rate on your dataset across versions, not a single impressive demo.
results = [grade_case(model, c) for c in dataset]
pass_rate = sum(r["passed"] for r in results) / len(results)
print(f"{model}: {pass_rate:.1%} passed")
Because Model Database exposes many models behind one API, you can score the same dataset across openai/gpt-4o, anthropic/claude-sonnet-4-6, and google/gemini-2.0-flash by changing one string, then pick on evidence rather than reputation.
Watch cost and latency too
Quality is one axis. Log token usage and response time per case so you can weigh a small accuracy gain against a large cost or latency increase. The best model for a task is frequently the cheapest one that clears your quality bar.
Honest limitations
No offline eval perfectly predicts production. Outputs are non-deterministic, so run multiple samples per case for stability. Keep the dataset fresh by feeding in real failures from production, and never let your evaluation set leak into prompts, or you'll measure memorization instead of capability.
Set up evaluation across models with one API key from your dashboard, and see the model list and usage fields in the docs.