Testing and CI for LLM-Powered Features

LLM features break the assumptions behind normal testing: outputs are non-deterministic, "correct" is fuzzy, and the dependency lives behind a network call you don't control. That doesn't mean you can't test them. It means you test differently. This article lays out a practical testing and CI strategy for features built on the Model Database API.

Separate deterministic code from the model

Most of your codebase, prompt assembly, parsing, validation, retries, is plain deterministic logic. Test it with ordinary unit tests and a mocked client. No network, no flakiness, fast feedback.

def build_messages(user_text, context):
    return [
        {"role": "system", "content": "Answer from context only."},
        {"role": "user", "content": f"{context}\n\n{user_text}"},
    ]

def test_build_messages():
    msgs = build_messages("hi", "ctx")
    assert msgs[0]["role"] == "system"
    assert "ctx" in msgs[1]["content"]

Mock the API for these so they run in milliseconds and never spend credits.

from unittest.mock import MagicMock

def test_parse_handles_fenced_json():
    fake = MagicMock()
    fake.choices[0].message.content = '```json\n{"ok": true}\n```'
    assert parse_response(fake) == {"ok": True}

Test prompts against a fixed dataset

For the model itself, build a small labeled set of representative inputs with expected outcomes or rubrics, and score against it. These are evaluation tests, not pass/fail asserts on exact strings. Assert on properties: the label is valid, the JSON parses, the required field exists, the score clears a threshold.

from openai import OpenAI
client = OpenAI(base_url="https://modeldatabase.com/v1", api_key="mdb_live_...")

CASES = [
    {"text": "my card was charged twice", "label": "billing"},
    {"text": "the app won't load",        "label": "technical"},
]

def test_classifier_pass_rate():
    passed = sum(classify(c["text"]).strip() == c["label"] for c in CASES)
    assert passed / len(CASES) >= 0.9   # threshold, not perfection

Use a threshold rather than demanding 100%, since outputs vary. Run each case a few times if you need stability.

Keep model calls out of the fast suite

Split your tests into tiers. Unit tests with mocks run on every commit. The live-model evaluation suite, which costs credits and takes longer, runs on a schedule or on demand, not on every push. This keeps PR feedback fast while still catching regressions.

# pytest markers
@pytest.mark.live
def test_live_classifier():
    ...
# CI: pytest -m "not live"   on PRs
#     pytest -m live         nightly

Pin versions and watch for drift

Models change. Pin the exact model ID you test against (for example openai/gpt-4o) so results are comparable over time, and re-run your evaluation suite when you intentionally switch models. Because Model Database exposes many models behind one API, your nightly job can score the same dataset across several model IDs and alert if a candidate beats or regresses your current choice.

Make CI fail on real regressions

Gate on pass rate: fail the live job if accuracy drops below a baseline.
Track cost and latency per run; flag large jumps.
Store results so you can see trends, not just the latest number.
Add new failures from production back into the dataset so the same bug can't return.

Honest limitations

Live tests cost money and can be flaky for reasons outside your code, network, rate limits, model updates, so isolate them and retry transient failures rather than failing the build. Offline scores don't perfectly predict production, so pair CI with real-world monitoring. And keep your test set out of your prompts, or you'll measure leakage instead of capability.

Set up evaluation across models with one key from your dashboard, and find model IDs and usage fields in the docs.

Testing and CI for LLM-Powered Features

Separate deterministic code from the model

Test prompts against a fixed dataset

Keep model calls out of the fast suite

Pin versions and watch for drift

Make CI fail on real regressions

Honest limitations

More in Engineering

A Practical Guide to Retrieval-Augmented Generation

Function Calling and Tool Use, Explained

Getting Reliable JSON Out of LLMs