Extracting Structured Data From Documents

Invoices, contracts, resumes, and forms hide structured data inside messy, inconsistent layouts. Traditional parsers break the moment a vendor changes a template. LLMs are far more robust because they read for meaning, not position. This article builds a reliable document extraction service on Model Database that turns raw text into validated JSON.

The trick to making extraction production-grade is not the prompt alone; it is pairing the model with a strict schema and validation so you never trust unverified output.

The extraction loop

A dependable pipeline has three steps: get text out of the document, ask the model to fill a known schema, then validate. If validation fails, you can retry with the errors fed back in.

Text extraction: use a PDF or OCR library to get plain text. The model works on text, not pixels.
Structured generation: prompt with the exact fields you need and require JSON output.
Validation: enforce types and required fields before anything downstream sees the data.

Defining the schema first

Start from the data contract, not the prompt. A schema in code keeps the model honest and gives you a place to validate.

from pydantic import BaseModel, field_validator
from typing import Optional

class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price: float

class Invoice(BaseModel):
    invoice_number: str
    vendor: str
    total: float
    currency: str
    items: list[LineItem]
    due_date: Optional[str] = None

    @field_validator("currency")
    @classmethod
    def iso_currency(cls, v):
        assert len(v) == 3, "currency must be ISO 4217"
        return v.upper()

The extraction call

Pass the schema description to the model and require JSON. For most documents a small, fast model is plenty; reserve a larger one for dense legal text.

from openai import OpenAI
import json

client = OpenAI(
    base_url="https://modeldatabase.com/v1",
    api_key="mdb_live_...",
)

def extract(text):
    resp = client.chat.completions.create(
        model="openai/gpt-4o-mini",
        messages=[
            {"role": "system", "content":
             "Extract invoice fields. Return JSON with keys: "
             "invoice_number, vendor, total, currency, items "
             "(description, quantity, unit_price), due_date. "
             "Use null if a field is absent. Do not guess."},
            {"role": "user", "content": text},
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(resp.choices[0].message.content)

The instruction "do not guess" matters. You want missing fields reported as null, not hallucinated, so validation can catch genuinely incomplete documents.

Validate, then retry on failure

Run the model output through your schema. If it fails, send the validation errors back to the model for one correction attempt. This self-healing step dramatically improves reliability.

from pydantic import ValidationError

def extract_validated(text, attempts=2):
    last_error = None
    for _ in range(attempts):
        raw = extract(text if last_error is None
                       else f"{text}\n\nFix these errors: {last_error}")
        try:
            return Invoice.model_validate(raw)
        except ValidationError as e:
            last_error = str(e)
    raise ValueError(f"extraction failed: {last_error}")

Numbers, dates, and totals

Extraction is most error-prone on arithmetic and dates. Two cheap safeguards go a long way:

Recompute totals yourself from line items and compare to the extracted total. A mismatch flags the document for human review.
Normalize dates with a parsing library after extraction, rather than trusting the model to output a canonical format.

def total_matches(inv: Invoice, tol=0.01):
    computed = sum(i.quantity * i.unit_price for i in inv.items)
    return abs(computed - inv.total) <= tol

Scaling and cost

Documents arrive in bursts, so process them through a queue with concurrent workers. Keep each request to a single document to bound context size and cost. Log token usage per document; if your volume is high, test whether openai/gpt-4o-mini meets your accuracy bar before reaching for a larger model. Because Model Database bills prepaid per token, you can run a sample batch on two models and pick the cheapest one that passes validation often enough.

Route anything that fails validation or the total check to a human queue. Over time, that queue tells you exactly where to improve your prompt or schema.

Get a key and credit at your dashboard, and find JSON-mode details in the docs.

Extracting Structured Data From Documents

The extraction loop

Defining the schema first

The extraction call

Validate, then retry on failure

Numbers, dates, and totals

Scaling and cost

More in Use Cases

Building a Customer Support Assistant

A Content Generation Pipeline That Scales

Automating Code Review With LLMs