retrieval-augmented generation (RAG)

technical-blog · retrieval-augmented generation

hello this is just a vibe coded template. will edit soon!

YYYY-MM-DD · ml, llms, rag

large language models are great at pattern matching, but they hallucinate when they have to invent facts. retrieval-augmented generation (rag) fixes this by letting the model look things up in an external knowledge base before answering.

in this post, we'll:

build a tiny rag demo that runs in your browser,
walk through the core architecture (indexing → retrieval → generation), and
sketch a python version you can scale up in a colab notebook.

rag in one picture

conceptually, rag is just:

user question
     │
     ▼
 ┌───────────┐      ┌────────────────────┐
 │ embedder  │      │  vector database   │
 └───────────┘      └────────────────────┘
        │   retrieve top-k chunks   ▲
        └───────────────┬───────────┘
                        │
                        ▼
                 ┌────────────┐
                 │   llm      │
                 └────────────┘
                        │
                        ▼
                grounded answer ✔

browser demo: keyword-based "rag"

a full rag system needs embeddings + a vector database. for a quick interactive demo, we can approximate retrieval with a simple keyword overlap score over a tiny "corpus" of notes.

try it

ask a question about rag, and i'll show you which notes get "retrieved" and a toy answer.

your question

retrieved notes

answer

ask a question to see an answer here.

python sketch: real rag with embeddings

below is a minimal python sketch you can adapt in a colab notebook (for example, the one linked at the top of this post).

from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI

# 1. load & split documents
raw_docs = [
    "rag separates retrieval from generation...",
    "use a vector database to store document embeddings...",
    # add your own notes or PDFs here
]

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
)
docs = splitter.create_documents(raw_docs)

# 2. build the vector index
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
db = FAISS.from_documents(docs, embeddings)

# 3. rag chain (retrieve + generate)
llm = ChatOpenAI(model="gpt-4.1-mini")

def rag_answer(question: str):
    retrieved_docs = db.similarity_search(question, k=4)
    context = "\n\n".join([d.page_content for d in retrieved_docs])

    prompt = f"""
    you are a helpful tutor. use ONLY the context below to answer.
    if something isn't in the context, say you don't know.

    context:
    {context}

    question: {question}
    """

    response = llm.invoke(prompt)
    return response.content, retrieved_docs

where to go next

swap the toy browser demo with real embeddings via an api.
index your own notes, pdfs, or blog posts.
add evaluation: log retrieved chunks + answers and inspect failure modes.