May 16, 20265 min read

RAG - Complete Practical Guide

RAG, an LLM upgrade.

RAG - Complete Practical Guide

Introduction

Retrieval Augmented Generation, is one of the biggest pillars in todays AI field. Mainly used by big companies for better internal gestion and retrieval of documents. In this article I will be explaing some RAG concepts with code snippets for a better grasp, and also be talking about some common problems I faced when implementing my own RAG, and presenting some solutions all along.

What is RAG?

RAG (Retrieval Augmented Generation) is a system design pattern that combines:

  • Information retrieval (finding relevant knowledge)
  • Large Language Models (LLMs) (generating responses)

Instead of relying only on what the model had learned during training, a RAG system retrieves external knowledge and injects it into the prompt.

Traditional LLM

Question
   ↓
Model Memory (Training Data)
   ↓
Answer

Problem:

  • knowledge can be outdated
  • hallucinations happen
  • cannot access private company data

RAG based LLM

Question
   ↓
Retrieve Relevant Knowledge
   ↓
Add Context to Prompt
   ↓
LLM Generates Grounded Answer

This makes answers:

  • more accurate
  • grounded in documents
  • customizable
  • domain-specific

Why RAG?

LLMs are powerful but limited.

Common problems:

1. Hallucinations

The model invents facts.

Example:

Question:
Who founded Company X?

Answer:
John Smith.

Even if John Smith never existed.

2. Knowledge Cutoff

Models only know what they were trained on.

They do not automatically know:

  • your PDFs
  • internal documentation
  • GitHub repositories
  • recent updates

3. Private Data

Businesses need AI over:

  • internal docs
  • policies
  • tickets
  • codebases

RAG solves this.

Core Architecture

A RAG system usually contains:

  1. Documents
  2. Chunking system
  3. Embedding model
  4. Vector database
  5. Retriever
  6. Prompt constructor
  7. LLM

Architecture:

Documents
   ↓
Chunking
   ↓
Embeddings
   ↓
Vector Database

User Question
   ↓
Question Embedding
   ↓
Similarity Search
   ↓
Relevant Chunks
   ↓
Prompt Construction
   ↓
LLM
   ↓
Answer

How RAG Works Step by Step

1. Documents

The system starts with raw documents.

Examples:

  • TXT files
  • PDFs
  • Markdown files
  • HTML pages
  • GitHub repos

Example text:

RAG systems use vector databases to retrieve
relevant information for LLMs.

2. Chunking

Documents are split into smaller sections.

Why?

Embedding entire books is ineffective.

Instead:

Large Document
   ↓
Small Chunks

Example:

Chunk 1 → Intro
Chunk 2 → Embeddings
Chunk 3 → Pinecone

3. Embeddings

Every chunk becomes a vector.

Example:

"RAG systems use retrieval"

becomes:

[0.12, -0.77, 0.48, ...]

4. Store in Vector Database

Vectors are stored in:

  • Pinecone
  • Weaviate
  • Qdrant
  • Chroma
  • FAISS

5. User Question

Example:

What are embeddings?

Question becomes a vector too.

The vector database finds:

Most similar chunks

based on mathematical similarity.

7. Prompt Construction

Retrieved chunks are injected into prompt.

Example:

Context:
Embeddings are vector representations.

Question:
What are embeddings?

8. LLM Generation

The LLM generates an answer using retrieved context.

Key Concepts and Definitions

1. Embedding

A numerical semantic representation of text.

Example:

"Machine learning"
↓
[0.12, -0.34, ...]

Purpose:

  • semantic understanding
  • similarity search

2. Vector

An ordered list of numbers.

Example:

[0.12, -0.55, 0.91]

3. Dimension

The number of values inside a vector.

Example:

768-dimensional vector

means:

768 numbers

Why it matters:

Your vector DB dimension must match embedding dimension.

Example:

nomic-embed-text → 768
Pinecone index → must be 768

Search by meaning.

Not exact keywords.

Example:

Question:

How does memory work?

Can retrieve:

Agents retain context using memory systems.

5. Similarity Score

Measures closeness between vectors.

Higher score:

More relevant

Top-K

How many results to retrieve.

Example:

top_k=5

Means:

Return best 5 chunks

6. Metadata

Extra information attached to vectors.

Example:

{
  "text": "Embeddings are vectors",
  "source": "notes.txt",
  "topic": "rag"
}

Embeddings Explained

Embeddings convert text into mathematical meaning.

Texts with similar meanings end up close together.

Example:

"How to build AI agents"

and

"Creating autonomous agents"

become nearby vectors.

Generating Embeddings with Ollama

import ollama


def generate_embedding(text):
    response = ollama.embeddings(
        model="nomic-embed-text",
        prompt=text
    )

    return response["embedding"]

Test:

embedding = generate_embedding(
    "What is RAG?"
)

print(len(embedding))
print(embedding[:10])

The code snippets seen above are from a RAG project I implemented, you can view the source code here

Vector Databases

A vector database stores embeddings.

Traditional DB:

Search by exact values

Vector DB:

Search by similarity

Common vector DBs:

  • Pinecone
  • Qdrant
  • Weaviate
  • Chroma
  • FAISS

Chunking

Chunking is splitting documents.

1. Why Chunking Matters

Bad chunking = bad retrieval.

Example problem:

Chunk 1:
RAG systems use semantic

Chunk 2:
search through vectors

Meaning gets broken.

2. Character-Based Chunking

def chunk_text(text,
               chunk_size=800,
               overlap=150):

    chunks = []
    start = 0

    while start < len(text):

        end = start + chunk_size

        chunk = text[start:end]
        chunks.append(chunk)

        start += chunk_size - overlap

    return chunks

3. Overlap

Preserves context.

Example:

Chunk 1 → 0-800
Chunk 2 → 650-1450

Overlap:

150 characters

Pinecone compares vectors.

Usually using:

Cosine Similarity

Measures angle similarity.

Similar meaning:

High cosine score

Retrieval Pipeline

Example retrieval:

query_embedding = generate_embedding(query)

results = index.query(
    vector=query_embedding,
    top_k=5,
    include_metadata=True
)

Explanation:

vector=query_embedding

Search using question vector.

top_k=5

Retrieve top 5 results.

include_metadata=True

Return original chunk text.

Prompt Augmentation

This is the "augmentation" in RAG.

We inject context.

Example:

context = "\n\n".join(
    match["metadata"]["text"]
    for match in results["matches"]
)

Prompt Example

prompt = f"""
You are a helpful assistant.

Answer ONLY using the context.

Context:
{context}

Question:
{query}

Answer:
"""

Generation Phase

Send prompt to the LLM. For me, I used my local LLM Mistral

response = ollama.chat(
    model="mistral",
    messages=[
        {
            "role": "user",
            "content": prompt
        }
    ]
)

print(response["message"]["content"])

Pinecone Concepts

Below are some Pinecone concepts I used and hope you might find helpful.

1. Index

Container of vectors.

Equivalent to:

Database table

2. Creating Index

from pinecone import Pinecone

pc = Pinecone(api_key=API_KEY)

pc.create_index(
    name="rag-demo",
    dimension=768,
    metric="cosine",
    spec={
        "serverless": {
            "cloud": "aws",
            "region": "us-east-1"
        }
    }
)

3. Upsert

Insert/update vectors.

index.upsert(vectors=vectors)

4. Query

Search vectors.

index.query(...)

5. Delete

Delete vectors.

index.delete(delete_all=True)

Metadata in RAG

Store useful context.

Example:

metadata={
    "text": chunk,
    "source": "notes.txt",
    "section": "embeddings"
}

Useful later for:

  • filtering
  • citations
  • debugging

Best Practices

These are some best practices to follow when building your RAG system:

  1. Retrieval quality > model quality
  2. Use metadata
  3. Keep chunks meaningful
  4. Avoid tiny chunks
  5. Re-index after document updates
  6. Use overlap
  7. Start simple before frameworks
  8. Debug retrieval separately from generation

However there is some considerations, as real production RAG systems often add features not present in my personal simple RAG system, such as:

  • authentication
  • streaming
  • caching
  • citations
  • reranking
  • hybrid search
  • observability
  • evaluation pipelines
  • vector versioning
  • document syncing

Glossary

TermMeaning
RAGRetrieval-Augmented Generation
EmbeddingNumerical representation of text
VectorOrdered list of numbers
DimensionNumber of values in vector
ChunkSmall document section
MetadataExtra vector information
Top-KNumber of retrieved results
Similarity SearchFinding closest vectors
Cosine SimilarityVector closeness metric
IndexPinecone vector collection
UpsertInsert/update vector
RetrievalFinding relevant knowledge
GenerationProducing final answer
HallucinationFabricated answer
RerankingReordering retrieved chunks
Hybrid SearchSemantic + keyword retrieval

Conclusion

Dear reader, I hope my POV of RAGs helped you even a little bit to understand how these systems work under the hood from embedding to retrieving to generating the proper response. And this is the essence of a RAG system.

− −···