Every growing company accumulates knowledge in wikis, docs, Slack threads, and people's heads. New hires can't find it; veterans get interrupted to re-explain it. An internal knowledge assistant fixes this by answering questions from your own documents. This article builds one on Model Database using retrieval-augmented generation (RAG).
RAG is the right pattern because it grounds answers in your actual content and cites sources, instead of relying on whatever the model memorized.
How RAG works
The assistant has two phases. Offline, you index your documents into a vector store. Online, you retrieve the chunks most relevant to a question and ask the model to answer using only those chunks.
- Ingest: split documents into chunks and embed them.
- Retrieve: embed the question, find the nearest chunks.
- Generate: answer strictly from retrieved context, with citations.
Ingesting your documents
Chunk documents into passages of a few hundred words with slight overlap so context isn't cut mid-thought. Store each chunk's text and source alongside its embedding. Use any embedding model and vector database you like; the generation step is where Model Database comes in.
def chunk(text, size=900, overlap=150):
out, i = [], 0
while i < len(text):
out.append(text[i:i + size])
i += size - overlap
return out
Retrieval and the grounded answer
Once you've retrieved the top chunks for a question, the prompt does the heavy lifting. The system message must force the model to stay inside the provided context and to admit when it doesn't know.
from openai import OpenAI
client = OpenAI(
base_url="https://modeldatabase.com/v1",
api_key="mdb_live_...",
)
SYS = """You answer employee questions from the provided context only.
Cite sources as [1], [2] matching the numbered passages.
If the context lacks the answer, say so and suggest who to ask.
Never invent policy, numbers, or links."""
def answer(question, passages):
context = "\n\n".join(
f"[{i+1}] ({p['source']}) {p['text']}"
for i, p in enumerate(passages)
)
resp = client.chat.completions.create(
model="anthropic/claude-sonnet-4-6",
messages=[
{"role": "system", "content": SYS},
{"role": "user", "content":
f"Context:\n{context}\n\nQuestion: {question}"},
],
temperature=0.1,
)
return resp.choices[0].message.content
Low temperature and a strict prompt keep answers faithful. The citation requirement lets employees verify, which is what makes the assistant trustworthy.
Wiring up a chat endpoint
Wrap retrieval and generation behind a simple endpoint, and stream the response so the UI feels instant.
def chat(question, retrieve):
passages = retrieve(question, k=6)
stream = client.chat.completions.create(
model="anthropic/claude-sonnet-4-6",
messages=[
{"role": "system", "content": SYS},
{"role": "user", "content": build_prompt(question, passages)},
],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
yield delta
Permissions and freshness
- Respect access control: filter retrieval by what the asking user is allowed to see. Never let the model surface a document the employee can't open. Do this at the retrieval layer, before anything reaches the prompt.
- Keep the index fresh: re-index changed documents on a schedule or via webhooks so answers reflect current policy.
- Show sources: render the cited passages with links so people can read the original.
- Capture gaps: log questions the assistant couldn't answer. That list is a roadmap for what documentation to write next.
Choosing a model
Knowledge answers benefit from solid reasoning and instruction-following, so anthropic/claude-sonnet-4-6 is a sensible default. For very high volume or simpler FAQs, test openai/gpt-4o-mini and compare answer quality on a fixed set of real questions. Because both run through the same Model Database endpoint, switching models is a one-line change, and prepaid pay-as-you-go billing means you only pay for the questions people actually ask.
Rolling it out
Start with one well-maintained documentation set, such as your engineering handbook or HR policies, before expanding. A narrow, accurate assistant earns trust; a broad, hallucinating one loses it immediately. Measure deflection (questions answered without a human) and correction rate to know it's working.
Get your API key and credit at your dashboard, and see streaming and model details in the docs.