Engineering

How to Evaluate LLM Outputs

DPDevon PrattMar 26, 20264 min read

"It looks good" is not an evaluation. If you ship LLM features, you need a repeatable way to measure quality so you can tell whether a prompt tweak, a model swap, or a temperature change actually helped. This article lays out practical evaluation methods you can run today against the Model Database API.

Start with a labeled dataset

Collect 50–200 real inputs that represent your traffic, including the awkward edge cases. For each, record what a good answer looks like, an exact expected value, an acceptable range, or a rubric. This fixed set is your regression suite; every change gets scored against the same examples so results are comparable.

Choose the right scoring method

Deterministic checks first

Most failures are catchable without a model. Run cheap assertions before anything fancy.

def grade(output, case):
    checks = {
        "is_json": is_valid_json(output),
        "has_fields": all(k in output for k in case["required"]),
        "within_len": len(output) <= case["max_len"],
    }
    return checks, all(checks.values())

LLM-as-judge, done carefully

For quality dimensions like helpfulness or faithfulness, a judge model scales better than human review. Give it a narrow rubric and force a structured verdict.

from openai import OpenAI
client = OpenAI(base_url="https://modeldatabase.com/v1", api_key="mdb_live_...")

JUDGE = """Score the answer 1-5 for faithfulness to the context.
5 = every claim supported; 1 = mostly unsupported.
Return JSON: {"score": int, "reason": string}."""

def judge(context, answer):
    r = client.chat.completions.create(
        model="anthropic/claude-opus-4-8",
        temperature=0,
        response_format={"type": "json_object"},
        messages=[
            {"role": "system", "content": JUDGE},
            {"role": "user", "content": f"Context:\n{context}\n\nAnswer:\n{answer}"},
        ],
    )
    return r.choices[0].message.content

Use a different (ideally stronger) model as the judge than the one under test, keep temperature at 0, and validate the judge itself against a handful of human-labeled cases. Judges have biases, they favor longer answers and their own style, so don't treat the score as ground truth.

Track pass rate, not vibes

Run the whole suite, aggregate, and store the numbers. The metric that matters is pass rate on your dataset across versions, not a single impressive demo.

results = [grade_case(model, c) for c in dataset]
pass_rate = sum(r["passed"] for r in results) / len(results)
print(f"{model}: {pass_rate:.1%} passed")

Because Model Database exposes many models behind one API, you can score the same dataset across openai/gpt-4o, anthropic/claude-sonnet-4-6, and google/gemini-2.0-flash by changing one string, then pick on evidence rather than reputation.

Watch cost and latency too

Quality is one axis. Log token usage and response time per case so you can weigh a small accuracy gain against a large cost or latency increase. The best model for a task is frequently the cheapest one that clears your quality bar.

Honest limitations

No offline eval perfectly predicts production. Outputs are non-deterministic, so run multiple samples per case for stability. Keep the dataset fresh by feeding in real failures from production, and never let your evaluation set leak into prompts, or you'll measure memorization instead of capability.

Set up evaluation across models with one API key from your dashboard, and see the model list and usage fields in the docs.

← All articles Get your API key →