About Services Tools Blog Contact us
← Back to Blog 17 May 2026

Local RAG for Folder Workflows: A Filesystem Vector Store with LlamaIndex & Nomic

AI

A practical follow-up to the folder workflows post — a fast local vector store that turns a drop folder of PDFs and notes into searchable, source-traceable context for an agent, without giving the agent the keys to the filesystem.

The folder pattern handles structured work cleanly: a CSV in, a CSV out, a log line, done. Unstructured work — dozens of supplier PDFs, a decade of internal notes, a binder of policy documents — is a different shape of problem. There is no spreadsheet to clean; the question is “what does this pile of paper actually say about X?” and the answer needs to come back with a citation. That is the job retrieval-augmented generation, or RAG, was built for. This post adds a small, local RAG layer to the folder pattern using LlamaIndex, the Nomic embeddings model running on Ollama, and a filesystem vector store that lives next to the documents it indexes.

The reference implementations sit under web-order-research/ and wo-research-tools/ on our own machines — the same shape we are about to describe, hardened for live use. If you have read the rest of the series, none of the following will feel unfamiliar: the filesystem is still the queue, the SOP still tells the agent what to do, and there is still nothing to log into.

What RAG Actually Is, In One Paragraph

An embedding model converts a chunk of text into a vector — a fixed-length list of numbers that captures the meaning of the text well enough that semantically similar passages end up near each other in vector space. Index a corpus of documents that way and you can answer “what does my paperwork say about thing X?” by embedding the question, finding the nearest few chunks, and feeding only those chunks to a language model alongside the question. The model never sees the rest of the corpus; the agent never needs to walk the filesystem. The retrieval step keeps the context window small and the answers tied to source material the user can actually cite.

The Architecture

The architecture is the folder pattern with two extra ingredients: a /documents drop folder and a /storage directory holding the vector store. Both are plain folders on disk; the vector store itself is four small JSON files.

flowchart TB USER(["Owner / agent"]) --> DROP[/"./documents/<br/>dropped PDFs & notes"/] DROP --> INGEST[["ingest_fast.py<br/>LlamaIndex + OCR reader"]] INGEST -->|chunks + metadata| EMBED[["Nomic embeddings<br/>via Ollama (local)"]] EMBED --> STORE[/"./storage/<br/>docstore.json<br/>default__vector_store.json<br/>index_store.json"/] STORE --> API[["api-fast.py<br/>query endpoint"]] AGENT(["Agent harness"]) -->|natural language question| API API -->|answer + cited chunks| AGENT

Two scripts, one folder of source documents, one folder of vector data. The agent never needs read access to /documents or /storage; it talks only to the query endpoint. That last point is the whole reason the pattern is worth building — you get to hand the agent a large, searchable corpus while keeping the underlying files behind a much narrower interface than “here, have my filesystem”.

Why a Filesystem Vector Store

The natural temptation, the moment vectors enter the conversation, is to reach for a dedicated vector database — Pinecone, Weaviate, Qdrant, Chroma in server mode, the lot. For a small-business workload they are overkill in exactly the way Zapier was overkill for the folder workflow. LlamaIndex’s default storage backend writes the docstore, the vector store, and the index metadata to a handful of JSON files on disk. For a corpus of a few thousand chunks — which covers most of the document piles a small business actually owns — that is plenty. It is fast, it is free, it backs up with cp -r, and it lives in the same git repository as the SOP that describes the workflow.

FileWhat it holds
docstore.jsonThe text of every chunk, plus its source metadata (file name, page number).
default__vector_store.jsonThe embedding vector for each chunk, keyed by chunk id.
index_store.jsonThe index structure that ties chunks to vectors.
graph_store.jsonOptional graph relationships between nodes; usually empty for this style of index.

Four files, all human-readable, all easy to inspect with jq when something looks odd. The same philosophy that makes folder workflows pleasant to debug applies to the index itself: nothing is hidden behind a service.

Ingestion: Trace Every Chunk Back to Its Source

The ingestion script is short. LlamaIndex’s SimpleDirectoryReader — or, for scanned PDFs, a small OCR-aware wrapper around it — turns each file in /documents into a list of Document objects, one per page. Each Document carries metadata that survives the entire pipeline: file_name, page_label, and anything else worth keeping. A node parser splits long pages into chunks, the embedder turns each chunk into a vector, and the index writes the lot to /storage. Re-running the script picks up only files that are not already indexed, by checking the existing docstore for known file_name values.

Settings.embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",
    request_timeout=120.0,
)
Settings.llm = Ollama(model="gemma4", request_timeout=120.0)

reader = OCRPDFReader(use_ocr=True, ocr_languages="eng")
all_documents = []
for pdf_file in Path(documents_dir).glob("*.pdf"):
    all_documents.extend(reader.load_data(pdf_file))

existing = get_existing_docs(storage_dir)
new_docs = [d for d in all_documents
            if d.metadata.get("file_name") not in existing]

if os.path.exists(storage_dir):
    storage_context = StorageContext.from_defaults(persist_dir=storage_dir)
    index = load_index_from_storage(storage_context)
    nodes = SimpleNodeParser.from_defaults().get_nodes_from_documents(new_docs)
    index.insert_nodes(nodes)
else:
    index = VectorStoreIndex.from_documents(new_docs)

index.storage_context.persist(storage_dir)

The metadata is the part worth lingering on. Every chunk carries the file name and page label of the original document all the way through to the answer, which means the agent’s reply can be cited as “per 2025-supplier-terms.pdf, page 7” rather than the vague paraphrase you usually get from a chatbot. That citation is the difference between an answer the owner can act on and one they have to verify manually before doing anything with it.

Querying: A Narrow Door for the Agent

The query side is even smaller: load the index from /storage, embed the question, retrieve the top k chunks, hand them and the question to a local LLM, return the answer along with the source citations. Wrap that in a thin Flask endpoint — api-fast.py in our reference repo — and the agent has a single URL to call. No filesystem access, no embeddings library on the agent side, no vector database client to keep in sync.

flowchart TB AGENT(["Agent"]) -->|POST /query
question| API[["api-fast.py"]] API -->|embed| EMB[["Nomic embeddings"]] EMB --> RETR{"Top-k retrieval
from ./storage"} RETR --> CTX["Selected chunks
+ file_name + page"] CTX --> LLM[["Local LLM
via Ollama"]] LLM -->|answer + citations| AGENT

From the agent’s perspective the corpus may as well be infinite. It does not need a 200K-token context window; it needs the right four paragraphs out of the right two PDFs and the citations to prove they came from there. The retrieval step does that job in tens of milliseconds against a corpus of a few thousand chunks, on commodity hardware, with no network call leaving the building.

Where This Fits in the Folder Workflow

Add the RAG layer alongside an existing folder workflow rather than replacing it. The structured pipelines — postcode cleaners, invoice parsers, sales reports — keep doing what they do. The unstructured pile — supplier terms, internal SOPs, old quotes, contracts, manuals — goes into /documents, gets indexed nightly, and becomes a question-answerable resource for the agent on the next morning’s tasks. The same five-folder discipline still applies: there is one job, one folder, one SOP describing what belongs there and what comes back out.

FolderDrop inGet out
supplier_docs_rag/Supplier T&Cs, MSAs, price-list PDFsCited answers to “what does Supplier X charge for late returns?”
policy_rag/Internal HR & ops SOPsCited answers for “what is our policy on Y?” without re-reading the binder
quotes_history_rag/Historic quotes & PO archive“What did we charge this customer last time?” in one sentence

Operational Notes

  • Idempotent ingestion. The script checks the existing docstore for known file names and only embeds new documents. Re-running it on the same folder is safe and almost instant.
  • Run nightly, not interactively. A launchd or cron sweep at 03:00 keeps the index in step with whatever lands in /documents during the day. The query endpoint stays up; only the data behind it is refreshed.
  • Back the store up. The four JSON files in /storage are the entire index. cp -r storage storage.bak before any large re-ingest, and you can roll back in seconds if something goes sideways.
  • Pick the embedder once and stick to it. Mixing embedding models in a single store will quietly ruin retrieval quality. nomic-embed-text is a sensible default; switching later means re-embedding the corpus.
  • Chunk sizes matter more than the model. The defaults in SimpleNodeParser are reasonable for prose; long tables and code listings benefit from larger chunks. Tune once, write the chosen values into the SOP, leave them alone.

Closing Thought

RAG is often sold as a major piece of architecture — vector databases, dedicated infrastructure, a new platform to learn. Inside a small business it is none of those things. It is one Python script that ingests, one Python script that queries, one folder of source documents, and one folder of JSON. The agent gets a large, searchable, citable corpus through a single narrow endpoint; the owner keeps every document on their own machine; the cost of running it is the electricity for the embedding model. Same discipline, same tools, same boring layout — with a quietly powerful new question the agent can now answer about the business’s own paperwork.

If you would like the reference implementation cloned, customised against your own document pile, and left running on a machine of your choice with a one-line query endpoint your harness already knows how to call, get in touch.

Blog post by
Ilya Titov

Ilya Titov