Long-context models let you feed an entire book, codebase, or document set into a single prompt. That capability unlocks powerful workflows, but it also introduces new failure modes around cost, latency, and how reliably a model actually uses everything you give it. Working effectively with long context is as much about discipline as it is about window size.
This guide covers when long context helps, its pitfalls, and how to use it well on Model Database.
What long context buys you
A large context window means you can include more source material directly in the prompt instead of building complex retrieval pipelines. Common uses include analyzing a long contract, answering questions over a full set of meeting transcripts, reviewing a large code module, or maintaining a long conversation history. When all the relevant information fits in the window, the model can reason over it holistically.
The pitfalls of stuffing the window
Bigger is not automatically better. Be aware of three issues:
- Cost scales with input. You pay for every token you send. A huge prompt repeated across many requests gets expensive fast.
- Latency grows. More input generally means slower responses, which hurts interactive use.
- Relevance dilution. Burying the key fact in a mountain of irrelevant text can make the model more likely to overlook it. Curated, relevant context often beats raw volume.
The lesson: use long context because the task needs it, not because the window is available.
Choosing a long-context model
Different models offer different maximum context lengths, so check each model's limit before committing. Strong general models like anthropic/claude-sonnet-4-6, anthropic/claude-opus-4-8, and google/gemini-2.0-flash are common choices for long-input work, but always confirm the current context limit for the specific model. You can list available models programmatically:
curl https://modeldatabase.com/v1/models \
-H "Authorization: Bearer mdb_live_..."
A long-context request
Sending a large document is the same chat completion call, just with a big user message:
from openai import OpenAI
client = OpenAI(base_url="https://modeldatabase.com/v1", api_key="mdb_live_...")
document = open("contract.txt").read()
resp = client.chat.completions.create(
model="anthropic/claude-sonnet-4-6",
messages=[
{"role": "system", "content": "Answer only from the provided document. If unknown, say so."},
{"role": "user", "content": f"Document:\n{document}\n\nQuestion: What is the termination notice period?"},
],
)
print(resp.choices[0].message.content)
Grounding the model with a system instruction to answer only from the document reduces hallucination on long inputs.
Long context vs retrieval
Long context and retrieval-augmented generation (RAG) solve overlapping problems. As a rule of thumb:
- If the relevant material is bounded and you'll reference all of it, long context is simpler, just include it.
- If you are querying a large, growing corpus where only a few passages matter per question, retrieval is cheaper and often more accurate, because you send less and keep the signal high.
Many production systems combine both: retrieve the most relevant chunks, then pass a generous amount of that curated context to a capable model.
Keep cost under control
Long-context calls are where costs can surprise you, so measure deliberately. Every billable response returns X-MDB-Charged-USD and X-MDB-Balance-USD, letting you see exactly what a large prompt costs. A few habits help:
- Trim boilerplate and irrelevant sections before sending.
- Cache or reuse results when the same document is queried repeatedly.
- Use a cheaper model for the large bulk-reading pass and a stronger one only for the final synthesis.
For user-facing long-context work, enable streaming with "stream": true so the first tokens appear quickly even when the model has a lot to read.
Working with big documents or codebases? Get a key and add credit at your dashboard, check each model's context limits, and read the docs for streaming and the cost headers that keep long-context spend predictable.