When you need to process thousands of items through an LLM, the difference between a job that finishes in minutes and one that crawls for hours is how you handle batching and concurrency. Done well, you maximize throughput without tripping rate limits or wasting money on retries.
This article covers practical patterns for pushing volume through Model Database efficiently.
Batching versus concurrency
These two terms get conflated, so let us separate them:
- Batching means grouping multiple work items together. Sometimes that means several items in one prompt; sometimes it means collecting a queue of requests to dispatch as a unit.
- Concurrency means sending multiple requests in flight at the same time rather than waiting for each to return before starting the next.
Concurrency is where most of your throughput gains come from, because LLM calls are dominated by waiting on the network and the model, not on your CPU.
Run requests concurrently
Firing requests one at a time leaves your throughput bound by latency. With async concurrency you keep many calls in flight at once.
import asyncio, openai
client = openai.AsyncOpenAI(base_url="https://modeldatabase.com/v1",
api_key="mdb_live_...")
sem = asyncio.Semaphore(10) # cap in-flight requests
async def worker(item):
async with sem:
r = await client.chat.completions.create(
model="openai/gpt-4o-mini",
max_tokens=200,
messages=[{"role":"user","content":item}])
return r.choices[0].message.content
async def run(items):
return await asyncio.gather(*(worker(i) for i in items))
The semaphore is the important part. It caps how many requests run at once so you stay within rate limits instead of flooding the API and triggering errors.
Pack multiple items into one prompt, carefully
For very short items you can sometimes process several in a single call by asking for a structured list back. This amortizes the system prompt across many items.
{
"model": "google/gemini-2.0-flash",
"messages": [{"role":"user","content":
"Classify each line as spam or ham. Return JSON array.\n1. ...\n2. ...\n3. ..."}]
}
Use this judiciously. Oversized batches risk truncated output, harder error recovery (one bad item can spoil the whole batch), and bumping into the per-request cost cap. For most workloads, modest concurrency of single-item calls is simpler and more robust.
Tune the concurrency level with the headers
There is an ideal number of in-flight requests for your account and workload. Find it empirically: raise the semaphore limit until throughput stops improving or errors appear, then back off. The charge headers let you confirm you are not paying more per item as you scale up.
X-MDB-Charged-USD: 0.0003
X-MDB-Balance-USD: 92.41
Make retries cheap and safe
At volume, occasional failures are normal. Wrap each call in retry logic with exponential backoff and jitter so a transient error does not become a thundering herd.
async def with_retry(coro_fn, tries=4):
for n in range(tries):
try:
return await coro_fn()
except Exception:
await asyncio.sleep((2 ** n) + random.random())
raise
Make your work items idempotent and key your results by input, so a retry never double-charges you for output you already have.
Mind the cost of parallel failure
Concurrency multiplies mistakes. If a bug sends a malformed prompt, ten parallel workers send it ten times. Validate inputs before dispatch, keep max_tokens tight, and rely on the per-request cost cap as a backstop so no single call in the batch can run away.
A simple recipe
For most batch jobs: use single-item calls, a semaphore around 5 to 20 in flight, retries with backoff, a result cache keyed on input, and a tight max_tokens. That combination gets you high throughput, predictable cost, and clean recovery without much code.
Spin up a batch job and watch throughput and spend together on your dashboard. Compare model rates first on the pricing page.