Diego Duarte

Diego Duarte

Asking Our Documents the Right Questions — Locally

Nov 3 2025

9 min

Building a Private RAG Assistant for Company Knowledge

There’s a quiet thrill in asking an AI a question and getting a sharp, context-rich answer.

It’s even better when the AI knows your own company’s history — the old projects, the niche case studies, the little industry details that only live in your internal docs.

That was the spark behind this small proof of concept (PoC): a local AI that can answer questions about our own documents without sending a single byte to an external server.


Lessons First: Making Sense of the Terminology

Before diving into what we built, it’s worth untangling some of the terminology that can overwhelm anyone stepping into the AI space:

With these concepts in mind, the decisions we made will make more sense.


Why We Built It

Our motivation was straightforward: experiment, explore, and see how AI could help us in our workflow.

A practical case came up when preparing for meetings with potential leads. If we could surface old projects in the same industry quickly, we could bring success stories to the table at the right moment.

The problem: our documentation is extensive, and finding the right example in time isn’t always easy.

Hence, the idea of “asking” our documents directly.


Why Local and Why Ollama

We chose to keep the whole setup local. Not because cloud models aren’t good, but because sending internal documentation to an external API wasn’t something we wanted to do — even for a PoC. Local models gave us control, privacy, and independence.

For the runtime, we used Ollama because:

For a PoC, that combination of simplicity and flexibility made Ollama the right choice.


The Data Flow: Simple by Design

One of the nicest things about this PoC is how little ceremony it requires:

  1. Export a .zip of your docs (in our case, Markdown files from Outline).
  2. Drop it into the program.
  3. Ask questions.

No manual tagging, no special formatting — just “export, drop, ask.” The system handles recursive folders, chunking documents into smaller pieces, embedding them, and storing them in ChromaDB. It even keeps a hash of the last indexed file so unchanged docs aren’t reprocessed.

That simplicity was intentional: it makes experimentation comfortable.


Why RAG

RAG and fine-tuning aren’t mutually exclusive. They tackle different layers of the problem: fine-tuning reshapes how the model thinks — adapting its reasoning, tone, and domain understanding — while RAG extends what the model knows by dynamically grounding its answers in the most current and relevant data available at query time.

In our case, the document export is about 25 MB and changes often. Fine-tuning would force retraining every time new data appeared — a slow and unnecessary loop for an evolving dataset. RAG, instead, lets us retrieve the latest content at query time, keeping responses accurate without touching the model’s weights.

For this proof-of-concept, that balance made sense: real-time knowledge, minimal overhead, and the freedom to iterate fast — without pretending the model needs to memorize what it can simply look up.


Playing With Models

Ollama made it easy to try different models:

We’re also considering qwen3:4b and starcoder2:3b for further balance tests.


Where We’re Headed

Right now, the system runs in the terminal. The vision is to evolve it into a Slack bot so anyone in the company can query the documentation naturally.

That means tackling bigger questions:

But those are production challenges. For now, we’ve proven the core idea: a local AI can tap into company knowledge and provide answers without relying on the cloud.


How to Recreate the Setup

Here’s a short guide for replicating this PoC.

1. Install Ollama

Download from: https://ollama.com/download

Start the server:

ollama serve

Pull the models you want to test:

ollama pull phi3:latest
ollama pull phi3:mini
ollama pull tinyllama:latest

2. Prepare Your Documents

This PoC expects:

Example:

my_docs.zip
   ├── project1.md
   ├── industry_case.md
   ├── subfolder/
   │   └── nested_doc.md
   └── ...

3. Install Dependencies

pip install chromadb sentence-transformers requests tqdm

4. Index Your Documents

python main.py --zip /path/to/my_docs.zip --model phi3:latest

What happens:


5. Ask Questions

Question (or 'quit'): What projects have we done in the healthcare industry?

The system:


6. Iterate on Models

python main.py --zip my_docs.zip --fast
python main.py --zip my_docs.zip --ultra-fast

From here, turning it into a Slack bot is the natural next step: wrap the Q&A logic into a Slack app, run it on a server, and let the team start asking away.