If your team already digs through documents all day (manuals, contracts, policies, product sheets) and you want an assistant that answers questions about them without making things up, RAG is the technique and n8n is the fastest way to build it without writing a backend. A RAG agent in n8n is built from three pieces: a vector store that holds your documents as embeddings, the OpenAI Embeddings node that turns them into vectors, and the AI Agent node that, for each question, retrieves the relevant chunks and answers with that real information. You build the flow by dragging nodes, indexing 500 pages costs pennies, and each query with gpt-4o-mini runs USD 0.0003 to 0.002. This guide walks through the full flow with concrete business examples and costs worked out.
What RAG is and why it matters
RAG stands for Retrieval Augmented Generation: instead of asking the AI to answer from memory (where it hallucinates), it first searches the relevant chunks in your documents and injects them as context into the prompt. The model only drafts the answer from that material.
The problem it solves is concrete. A model like gpt-4o-mini doesn't know your warranty manual, your internal pricing, or the supplier contract you signed in March. Ask it, and it invents something plausible. With RAG, before answering, the system retrieves the two or three sections of your documents that cover the topic, and the model replies with real data, pointing at where it came from.
For a small business that's the difference between a chatbot that gives generic answers and an assistant that says "per the returns policy in effect since April, you have 30 days" because it actually read your PDF.
The pieces of the flow in n8n
n8n ships every node natively. No separate backend required:
| Piece | n8n node | What it does |
|---|---|---|
| Embeddings | OpenAI Embeddings (text-embedding-3-small) | Turns text into numeric vectors |
| Vector store | Pinecone / Qdrant / Supabase / PGVector | Stores and searches those vectors by similarity |
| Loader | Default Data Loader + Text Splitter | Splits documents into chunks |
| Brain | AI Agent + OpenAI Chat Model | Retrieves chunks and drafts the answer |
| Memory | Window Buffer Memory (optional) | Keeps the thread of a conversation |
Step 1: index your documents (the ingestion flow)
This flow runs once (or whenever documents change). The goal is to fill the vector store.
- Trigger: a manual node, or a Google Drive trigger that fires when you drop a PDF into a folder.
- Extract text: the Extract from File node pulls text from the PDF, DOCX or TXT.
- Text Splitter: split the text into chunks of roughly 500-1000 characters with 100 of overlap. Chunking is critical: chunks too large dilute the search, too small lose context.
- OpenAI Embeddings: convert each chunk into a vector. Use
text-embedding-3-small, which costs USD 0.02 per million tokens. - Vector Store (Insert): store each vector with its original text as metadata.
Indexing a 500-page manual (~250,000 words, ~330,000 tokens) costs about USD 0.007 in embeddings. That's not a typo: it's under a cent.
Step 2: the agent that answers (the query flow)
This is the flow your team uses day to day, wired to a form, WhatsApp, or an internal chat.
- Trigger: webhook, chat, or a WhatsApp/Telegram node.
- AI Agent: the central node. You attach the Vector Store (Retrieve) as a tool, which for each question fetches the top-k chunks (typically 3-4) that match best.
- OpenAI Chat Model: gpt-4o-mini for 80% of cases; bump to gpt-4o only when answers need finer reasoning.
- Memory (optional): for a conversational chat, add Window Buffer Memory so it remembers earlier questions.
- Response: the agent returns the text, ideally citing the source document.
The trick is the agent's system prompt: tell it explicitly "answer only with the information from the retrieved documents; if it's not there, say you don't know." That cuts hallucinations to nearly zero.
Want an assistant that answers about your manuals and contracts without inventing, live this week? Book an intro call and we'll show you a RAG running on your own documents.
A real example: a distributor's support
An auto-parts distributor had a technical catalog of 1,200 sheets (compatibilities, dimensions, equivalents). The sales team lost 10-15 minutes per query digging through PDFs.
We built a RAG in n8n wired to WhatsApp:
- Indexing: the 1,200 sheets in Qdrant, embedding cost USD 0.03 once.
- Query: the rep asks "which oil filter fits a 2018 diesel Hilux?" and the agent retrieves the right sheet and answers in 4 seconds.
- Cost per query: ~USD 0.001 with gpt-4o-mini.
- Result: 200 daily queries resolved without opening a PDF, monthly AI cost near USD 6.
This kind of integration usually pairs with AI chatbots or a broader AI automation flow, depending on where the questions live.
End-to-end real costs
| Item | Model / service | Estimated cost |
|---|---|---|
| Index 500 pages | text-embedding-3-small | ~USD 0.007 (once) |
| Embedding each question | text-embedding-3-small | ~USD 0.000002 |
| Answer per query | gpt-4o-mini | ~USD 0.0003 to 0.002 |
| 1,000 queries/month | gpt-4o-mini | ~USD 1 to 3 |
| Managed vector store | Qdrant Cloud / Pinecone free | USD 0 to start |
The dominant cost isn't the AI: it's the time to get chunking and the system prompt right. Once calibrated, the RAG runs for pennies.
When a RAG in n8n does NOT make sense
Let's be honest, RAG isn't the answer to everything:
- Few documents: if you have 3 short PDFs that fit in the model's context, paste them straight into the prompt. RAG adds needless complexity.
- Data that changes every minute: the vector store needs reindexing when documents change. If the source updates constantly, a direct query to your database or API is better.
- Exact calculations: "how much did we sell in April?" is SQL over structured data, not semantic search. RAG is for text and qualitative information.
- You need 100% legal precision: RAG reduces hallucinations but doesn't eliminate them. For answers with legal consequences, always keep a human in the loop.
If your case calls for something more bespoke (business logic, permissions, complex integrations), it's probably not just an n8n workflow but a piece of custom software or a production AI agent with its own backend.
Common mistakes that wreck a RAG
- Badly defined chunks: mistake number one. Try different sizes and measure what retrieves best.
- Top-k too low or too high: with 1 chunk you lose context, with 15 you add noise and burn tokens. Start with 3-4.
- Mixing languages silently: if your documents are in Spanish and people ask in English, flag it in the prompt.
- Not storing the source: without document metadata, you can't verify where an answer came from.
- Defaulting to the priciest model: gpt-4o-mini handles most. Upgrade only when you notice it failing.
How we ship it in production
A demo RAG is built in an afternoon. A production one needs: automatic reindexing when documents change, control over who can ask what, query logs to audit answers, and a clean fallback when the agent finds nothing. That jump from "works on my screen" to "the whole team uses it without surprises" is where most projects stall.
At Deepyze we build RAG agents over each company's real documents: manuals, contracts, catalogs, knowledge bases. We wire it to WhatsApp, your CRM, or an internal chat, with controlled costs and verifiable answers. Start your project and on one call we'll tell you whether RAG is right for you or whether something else fits better, no hype.
Frequently asked questions
What is a RAG agent in n8n?+
RAG (Retrieval Augmented Generation) is a technique where the AI model, before answering, searches relevant chunks inside your own documents and uses them as context. In n8n it's built with three pieces: a vector store that holds your documents as embeddings, an OpenAI Embeddings node, and the AI Agent node that retrieves and answers. That way the model replies with your real information instead of making things up.
Do I need to code to build RAG in n8n?+
No. n8n ships native nodes for vector stores (Pinecone, Qdrant, Supabase, PGVector) and for OpenAI embeddings. You build the flow by dragging nodes. You should understand the concepts (chunking, embeddings, top-k) so answers stay accurate, but you won't write code beyond occasional data cleanup.
How much does running a RAG agent with OpenAI in n8n cost?+
Two costs: indexing your documents once (embeddings with text-embedding-3-small cost about USD 0.02 per million tokens, so 500 pages cost pennies) and each query (embedding the question plus the model's answer). With gpt-4o-mini, a typical RAG query costs between USD 0.0003 and USD 0.002. A thousand queries a month land around USD 1 to 3.
Which vector store should I start with?+
If you already use Postgres or Supabase, PGVector or Supabase Vector is simplest because you don't add another service. If you want something managed and free to start, Qdrant Cloud or Pinecone have free tiers that cover thousands of documents. For fully self-hosted, run Qdrant in Docker next to your n8n.
When does RAG in n8n NOT make sense?+
When you have few documents that fit inside the model's context (just paste them in the prompt), when the information changes every minute and you can't reindex fast enough, or when you need exact calculations over structured data (that calls for a database and SQL, not a vector store).
Does a RAG agent in n8n leak confidential information?+
The vector store holds chunks of your documents, so confidentiality depends on where you host it. With self-hosted Qdrant or PGVector, data never leaves your server except the text sent to OpenAI per query. If you need zero data to third parties, you can run an open-source embedding model and LLM on-premise.
Want this working in your company?
At Deepyze we turn manual processes into systems that work on their own: AI automation, web and mobile apps, and custom software. Tell us your case and you will have a concrete proposal within 24 hours.
Sin compromiso · Respuesta en 24 hs · Equipo en tu mismo huso horario