The folder pattern handles structured work cleanly: a CSV in, a CSV out, a log line, done. Unstructured work — dozens of supplier PDFs, a decade of internal notes, a binder of policy documents — is a different shape of problem. There is no spreadsheet to clean; the question is “what does this pile of paper actually say about X?” and the answer needs to come back with a citation. That is the job retrieval-augmented generation, or RAG, was built for. This post adds a small, local RAG layer to the folder pattern using LlamaIndex, the Nomic embeddings model running on Ollama, and a filesystem vector store that lives next to the documents it indexes.
The reference implementations sit under web-order-research/ and wo-research-tools/ on our own machines — the same shape we are about to describe, hardened for live use. If you have read the rest of the series, none of the following will feel unfamiliar: the filesystem is still the queue, the SOP still tells the agent what to do, and there is still nothing to log into.
What RAG Actually Is, In One Paragraph
An embedding model converts a chunk of text into a vector — a fixed-length list of numbers that captures the meaning of the text well enough that semantically similar passages end up near each other in vector space. Index a corpus of documents that way and you can answer “what does my paperwork say about thing X?” by embedding the question, finding the nearest few chunks, and feeding only those chunks to a language model alongside the question. The model never sees the rest of the corpus; the agent never needs to walk the filesystem. The retrieval step keeps the context window small and the answers tied to source material the user can actually cite.
The Architecture
The architecture is the folder pattern with two extra ingredients: a /documents drop folder and a /storage directory holding the vector store. Both are plain folders on disk; the vector store itself is four small JSON files.
Two scripts, one folder of source documents, one folder of vector data. The agent never needs read access to /documents or /storage; it talks only to the query endpoint. That last point is the whole reason the pattern is worth building — you get to hand the agent a large, searchable corpus while keeping the underlying files behind a much narrower interface than “here, have my filesystem”.
Why a Filesystem Vector Store
The natural temptation, the moment vectors enter the conversation, is to reach for a dedicated vector database — Pinecone, Weaviate, Qdrant, Chroma in server mode, the lot. For a small-business workload they are overkill in exactly the way Zapier was overkill for the folder workflow. LlamaIndex’s default storage backend writes the docstore, the vector store, and the index metadata to a handful of JSON files on disk. For a corpus of a few thousand chunks — which covers most of the document piles a small business actually owns — that is plenty. It is fast, it is free, it backs up with cp -r, and it lives in the same git repository as the SOP that describes the workflow.
| File | What it holds |
|---|---|
docstore.json | The text of every chunk, plus its source metadata (file name, page number). |
default__vector_store.json | The embedding vector for each chunk, keyed by chunk id. |
index_store.json | The index structure that ties chunks to vectors. |
graph_store.json | Optional graph relationships between nodes; usually empty for this style of index. |
Four files, all human-readable, all easy to inspect with jq when something looks odd. The same philosophy that makes folder workflows pleasant to debug applies to the index itself: nothing is hidden behind a service.
Ingestion: Trace Every Chunk Back to Its Source
The ingestion script is short. LlamaIndex’s SimpleDirectoryReader — or, for scanned PDFs, a small OCR-aware wrapper around it — turns each file in /documents into a list of Document objects, one per page. Each Document carries metadata that survives the entire pipeline: file_name, page_label, and anything else worth keeping. A node parser splits long pages into chunks, the embedder turns each chunk into a vector, and the index writes the lot to /storage. Re-running the script picks up only files that are not already indexed, by checking the existing docstore for known file_name values.
Settings.embed_model = OllamaEmbedding(
model_name="nomic-embed-text",
request_timeout=120.0,
)
Settings.llm = Ollama(model="gemma4", request_timeout=120.0)
reader = OCRPDFReader(use_ocr=True, ocr_languages="eng")
all_documents = []
for pdf_file in Path(documents_dir).glob("*.pdf"):
all_documents.extend(reader.load_data(pdf_file))
existing = get_existing_docs(storage_dir)
new_docs = [d for d in all_documents
if d.metadata.get("file_name") not in existing]
if os.path.exists(storage_dir):
storage_context = StorageContext.from_defaults(persist_dir=storage_dir)
index = load_index_from_storage(storage_context)
nodes = SimpleNodeParser.from_defaults().get_nodes_from_documents(new_docs)
index.insert_nodes(nodes)
else:
index = VectorStoreIndex.from_documents(new_docs)
index.storage_context.persist(storage_dir)
The metadata is the part worth lingering on. Every chunk carries the file name and page label of the original document all the way through to the answer, which means the agent’s reply can be cited as “per 2025-supplier-terms.pdf, page 7” rather than the vague paraphrase you usually get from a chatbot. That citation is the difference between an answer the owner can act on and one they have to verify manually before doing anything with it.
Querying: A Narrow Door for the Agent
The query side is even smaller: load the index from /storage, embed the question, retrieve the top k chunks, hand them and the question to a local LLM, return the answer along with the source citations. Wrap that in a thin Flask endpoint — api-fast.py in our reference repo — and the agent has a single URL to call. No filesystem access, no embeddings library on the agent side, no vector database client to keep in sync.
question| API[["api-fast.py"]] API -->|embed| EMB[["Nomic embeddings"]] EMB --> RETR{"Top-k retrieval
from ./storage"} RETR --> CTX["Selected chunks
+ file_name + page"] CTX --> LLM[["Local LLM
via Ollama"]] LLM -->|answer + citations| AGENT
From the agent’s perspective the corpus may as well be infinite. It does not need a 200K-token context window; it needs the right four paragraphs out of the right two PDFs and the citations to prove they came from there. The retrieval step does that job in tens of milliseconds against a corpus of a few thousand chunks, on commodity hardware, with no network call leaving the building.
Where This Fits in the Folder Workflow
Add the RAG layer alongside an existing folder workflow rather than replacing it. The structured pipelines — postcode cleaners, invoice parsers, sales reports — keep doing what they do. The unstructured pile — supplier terms, internal SOPs, old quotes, contracts, manuals — goes into /documents, gets indexed nightly, and becomes a question-answerable resource for the agent on the next morning’s tasks. The same five-folder discipline still applies: there is one job, one folder, one SOP describing what belongs there and what comes back out.
| Folder | Drop in | Get out |
|---|---|---|
supplier_docs_rag/ | Supplier T&Cs, MSAs, price-list PDFs | Cited answers to “what does Supplier X charge for late returns?” |
policy_rag/ | Internal HR & ops SOPs | Cited answers for “what is our policy on Y?” without re-reading the binder |
quotes_history_rag/ | Historic quotes & PO archive | “What did we charge this customer last time?” in one sentence |
Operational Notes
- Idempotent ingestion. The script checks the existing docstore for known file names and only embeds new documents. Re-running it on the same folder is safe and almost instant.
- Run nightly, not interactively. A
launchdorcronsweep at 03:00 keeps the index in step with whatever lands in/documentsduring the day. The query endpoint stays up; only the data behind it is refreshed. - Back the store up. The four JSON files in
/storageare the entire index.cp -r storage storage.bakbefore any large re-ingest, and you can roll back in seconds if something goes sideways. - Pick the embedder once and stick to it. Mixing embedding models in a single store will quietly ruin retrieval quality.
nomic-embed-textis a sensible default; switching later means re-embedding the corpus. - Chunk sizes matter more than the model. The defaults in
SimpleNodeParserare reasonable for prose; long tables and code listings benefit from larger chunks. Tune once, write the chosen values into the SOP, leave them alone.
Closing Thought
RAG is often sold as a major piece of architecture — vector databases, dedicated infrastructure, a new platform to learn. Inside a small business it is none of those things. It is one Python script that ingests, one Python script that queries, one folder of source documents, and one folder of JSON. The agent gets a large, searchable, citable corpus through a single narrow endpoint; the owner keeps every document on their own machine; the cost of running it is the electricity for the embedding model. Same discipline, same tools, same boring layout — with a quietly powerful new question the agent can now answer about the business’s own paperwork.
If you would like the reference implementation cloned, customised against your own document pile, and left running on a machine of your choice with a one-line query endpoint your harness already knows how to call, get in touch.