Cost & Scaling

Scaling to Millions of Requests

EFElena FischerJan 29, 20264 min read

Getting an LLM feature to work is one thing. Getting it to work at millions of requests a day, reliably and affordably, is a different engineering problem. At that scale, small inefficiencies multiply into large bills and small failure rates turn into constant fires.

This article walks through the architecture and habits that let you scale on Model Database without losing control of cost or reliability.

At scale, per-request waste is the enemy

A penny of overhead per call is nothing in a demo and a real budget at a million calls. Before scaling out, scale down the unit cost: trim prompts, cap max_tokens, route easy traffic to cheaper models like openai/gpt-4o-mini or google/gemini-2.0-flash, and cache repeated work. Every optimization you make to a single request is multiplied by your entire volume.

Separate real-time from background work

Not all requests need to be instant. Split your traffic into two lanes:

This separation lets you size each lane independently and prevents a batch job from starving live users.

Build on a queue and a worker pool

A durable queue in front of a pool of workers is the backbone of scale. It gives you buffering for spikes, controlled concurrency, natural retry points, and a place to enforce priority.

import asyncio, openai
client = openai.AsyncOpenAI(base_url="https://modeldatabase.com/v1",
                           api_key="mdb_live_...")
sem = asyncio.Semaphore(50)

async def worker(queue):
    while True:
        job = await queue.get()
        async with sem:
            try:
                r = await client.chat.completions.create(**job.payload)
                await job.complete(r)
            except Exception:
                await job.requeue_with_backoff()
        queue.task_done()

Scale throughput by adjusting the semaphore and the number of workers, not by rewriting logic.

Cache aggressively

At scale, duplicate and near-duplicate requests are guaranteed. A shared cache, keyed on the normalized prompt and model, removes a real fraction of calls entirely. Even a modest hit rate on millions of requests is a large saving, and cache reads are far faster than model calls, so latency improves too.

Instrument every call with the headers

You cannot manage spend at scale from a monthly invoice. Log X-MDB-Charged-USD on every response, tagged by feature and model, and aggregate it continuously.

resp = client.chat.completions.with_raw_response.create(...)
emit_metric("llm.cost", float(resp.headers["X-MDB-Charged-USD"]),
            tags={"feature": feature, "model": model})

Watch X-MDB-Balance-USD as a fleet-wide fuel gauge. Since a depleted balance returns HTTP 402, automate top-ups or alerts well before you reach zero so prepaid credit never becomes an outage.

Plan for failure as routine

At a million requests, a 0.1% failure rate is a thousand failures a day. That is normal, so make recovery automatic: retries with exponential backoff and jitter, idempotent jobs keyed by input so retries never double-charge, dead-letter handling for poison inputs, and graceful degradation when a model or balance is unavailable.

Let the cost cap protect the fleet

One malformed prompt replicated across thousands of workers can do real damage. The per-request cost cap on Model Database stops any single call from running away, which at scale is a critical safety rail rather than a nicety. Combine it with input validation so anomalies are blocked early and cheaply.

Scale in steps and measure

Do not jump from a thousand to a million requests overnight. Increase load in stages, and at each step check three numbers: cost per request from the headers, error rate, and latency. If cost per request holds steady as volume grows, your architecture is sound. If it creeps up, find the inefficiency before you scale further.

Ready to grow? Top up credit and monitor your fleet on your dashboard, and compare model rates on the pricing page.

← All articles Get your API key →