Use Cases

An Internal Knowledge Assistant for Your Team

JMJonas MeyerFeb 10, 20264 min read

Every growing company accumulates knowledge in wikis, docs, Slack threads, and people's heads. New hires can't find it; veterans get interrupted to re-explain it. An internal knowledge assistant fixes this by answering questions from your own documents. This article builds one on Model Database using retrieval-augmented generation (RAG).

RAG is the right pattern because it grounds answers in your actual content and cites sources, instead of relying on whatever the model memorized.

How RAG works

The assistant has two phases. Offline, you index your documents into a vector store. Online, you retrieve the chunks most relevant to a question and ask the model to answer using only those chunks.

Ingesting your documents

Chunk documents into passages of a few hundred words with slight overlap so context isn't cut mid-thought. Store each chunk's text and source alongside its embedding. Use any embedding model and vector database you like; the generation step is where Model Database comes in.

def chunk(text, size=900, overlap=150):
    out, i = [], 0
    while i < len(text):
        out.append(text[i:i + size])
        i += size - overlap
    return out

Retrieval and the grounded answer

Once you've retrieved the top chunks for a question, the prompt does the heavy lifting. The system message must force the model to stay inside the provided context and to admit when it doesn't know.

from openai import OpenAI

client = OpenAI(
    base_url="https://modeldatabase.com/v1",
    api_key="mdb_live_...",
)

SYS = """You answer employee questions from the provided context only.
Cite sources as [1], [2] matching the numbered passages.
If the context lacks the answer, say so and suggest who to ask.
Never invent policy, numbers, or links."""

def answer(question, passages):
    context = "\n\n".join(
        f"[{i+1}] ({p['source']}) {p['text']}"
        for i, p in enumerate(passages)
    )
    resp = client.chat.completions.create(
        model="anthropic/claude-sonnet-4-6",
        messages=[
            {"role": "system", "content": SYS},
            {"role": "user", "content":
             f"Context:\n{context}\n\nQuestion: {question}"},
        ],
        temperature=0.1,
    )
    return resp.choices[0].message.content

Low temperature and a strict prompt keep answers faithful. The citation requirement lets employees verify, which is what makes the assistant trustworthy.

Wiring up a chat endpoint

Wrap retrieval and generation behind a simple endpoint, and stream the response so the UI feels instant.

def chat(question, retrieve):
    passages = retrieve(question, k=6)
    stream = client.chat.completions.create(
        model="anthropic/claude-sonnet-4-6",
        messages=[
            {"role": "system", "content": SYS},
            {"role": "user", "content": build_prompt(question, passages)},
        ],
        stream=True,
    )
    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            yield delta

Permissions and freshness

Choosing a model

Knowledge answers benefit from solid reasoning and instruction-following, so anthropic/claude-sonnet-4-6 is a sensible default. For very high volume or simpler FAQs, test openai/gpt-4o-mini and compare answer quality on a fixed set of real questions. Because both run through the same Model Database endpoint, switching models is a one-line change, and prepaid pay-as-you-go billing means you only pay for the questions people actually ask.

Rolling it out

Start with one well-maintained documentation set, such as your engineering handbook or HR policies, before expanding. A narrow, accurate assistant earns trust; a broad, hallucinating one loses it immediately. Measure deflection (questions answered without a human) and correction rate to know it's working.

Get your API key and credit at your dashboard, and see streaming and model details in the docs.

← All articles Get your API key →