Invoices, contracts, resumes, and forms hide structured data inside messy, inconsistent layouts. Traditional parsers break the moment a vendor changes a template. LLMs are far more robust because they read for meaning, not position. This article builds a reliable document extraction service on Model Database that turns raw text into validated JSON.
The trick to making extraction production-grade is not the prompt alone; it is pairing the model with a strict schema and validation so you never trust unverified output.
The extraction loop
A dependable pipeline has three steps: get text out of the document, ask the model to fill a known schema, then validate. If validation fails, you can retry with the errors fed back in.
- Text extraction: use a PDF or OCR library to get plain text. The model works on text, not pixels.
- Structured generation: prompt with the exact fields you need and require JSON output.
- Validation: enforce types and required fields before anything downstream sees the data.
Defining the schema first
Start from the data contract, not the prompt. A schema in code keeps the model honest and gives you a place to validate.
from pydantic import BaseModel, field_validator
from typing import Optional
class LineItem(BaseModel):
description: str
quantity: int
unit_price: float
class Invoice(BaseModel):
invoice_number: str
vendor: str
total: float
currency: str
items: list[LineItem]
due_date: Optional[str] = None
@field_validator("currency")
@classmethod
def iso_currency(cls, v):
assert len(v) == 3, "currency must be ISO 4217"
return v.upper()
The extraction call
Pass the schema description to the model and require JSON. For most documents a small, fast model is plenty; reserve a larger one for dense legal text.
from openai import OpenAI
import json
client = OpenAI(
base_url="https://modeldatabase.com/v1",
api_key="mdb_live_...",
)
def extract(text):
resp = client.chat.completions.create(
model="openai/gpt-4o-mini",
messages=[
{"role": "system", "content":
"Extract invoice fields. Return JSON with keys: "
"invoice_number, vendor, total, currency, items "
"(description, quantity, unit_price), due_date. "
"Use null if a field is absent. Do not guess."},
{"role": "user", "content": text},
],
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(resp.choices[0].message.content)
The instruction "do not guess" matters. You want missing fields reported as null, not hallucinated, so validation can catch genuinely incomplete documents.
Validate, then retry on failure
Run the model output through your schema. If it fails, send the validation errors back to the model for one correction attempt. This self-healing step dramatically improves reliability.
from pydantic import ValidationError
def extract_validated(text, attempts=2):
last_error = None
for _ in range(attempts):
raw = extract(text if last_error is None
else f"{text}\n\nFix these errors: {last_error}")
try:
return Invoice.model_validate(raw)
except ValidationError as e:
last_error = str(e)
raise ValueError(f"extraction failed: {last_error}")
Numbers, dates, and totals
Extraction is most error-prone on arithmetic and dates. Two cheap safeguards go a long way:
- Recompute totals yourself from line items and compare to the extracted total. A mismatch flags the document for human review.
- Normalize dates with a parsing library after extraction, rather than trusting the model to output a canonical format.
def total_matches(inv: Invoice, tol=0.01):
computed = sum(i.quantity * i.unit_price for i in inv.items)
return abs(computed - inv.total) <= tol
Scaling and cost
Documents arrive in bursts, so process them through a queue with concurrent workers. Keep each request to a single document to bound context size and cost. Log token usage per document; if your volume is high, test whether openai/gpt-4o-mini meets your accuracy bar before reaching for a larger model. Because Model Database bills prepaid per token, you can run a sample batch on two models and pick the cheapest one that passes validation often enough.
Route anything that fails validation or the total check to a human queue. Over time, that queue tells you exactly where to improve your prompt or schema.
Get a key and credit at your dashboard, and find JSON-mode details in the docs.