How Do RAG and LLMs interact?

A Step-by-Step Look Inside pgvector + RAG

May 13, 2025

When people hear that large language models (LLMs) like GPT-4o can answer questions based on external documents, they often assume those documents were somehow "trained into" the model. But in most modern AI applications — especially those using Retrieval-Augmented Generation (RAG) — that's not the case.

So, how does a language model retrieve accurate, relevant information it wasn’t trained on?

The answer is: it searches, using a powerful combination of vector embeddings and a vector database like pgvector, layered on top of PostgreSQL.

Let’s break down exactly what happens when you ask a question.

Step 1: Your Question Becomes a Vector

The first step is to convert your natural language query into a dense vector — a numerical representation of its meaning. This is done using a specialized embedding model like text-embedding-3-small.

Example:

“How do I reduce my credit card debt?”
→ [0.034, -0.273, 0.891, ..., -0.116] (a 1536-dimensional vector)

This isn't magic — it's math trained on language patterns.

🔎 Step 2: pgvector Searches for Similar Meanings

Your system then issues a SQL query to pgvector, which is a PostgreSQL extension for storing and comparing vectors.

SELECT content
FROM documents
ORDER BY embedding <-> '[query_vector]'
LIMIT 5;

That <-> operator tells pgvector to compute distance between your query vector and every stored document vector, typically using cosine similarity.

Smaller distance = more semantically similar.

Step 3: pgvector Ranks the Most Relevant Content

Internally, pgvector:

Iterates through all stored embeddings (or uses an index if you're at scale)
Computes the similarity between your query and each document
Ranks the results
Returns the top-k most relevant chunks

These chunks aren’t just strings — they’re surgically sliced sections of documents (usually 256–512 tokens each), optimized during ingestion.

🧠 Step 4: The LLM Uses That Retrieved Context

Now that pgvector has served up the top matching documents, your system feeds them into an LLM like GPT-4o, using a prompt template like:

Answer the question below using the following context:

[Chunk 1: ...]
[Chunk 2: ...]
[Chunk 3: ...]

Q: How do I reduce my credit card debt?

The LLM uses this context, the list of retrieved chunks, to generate its final answer.

That’s the core of RAG: retrieval first, generation second.

🚀 Bonus: Fast, Scalable, and Flexible

pgvector supports approximate nearest neighbor search using IVFFlat indexes, which lets you scale up to millions of documents without losing performance.

Example:

CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100); SET ivfflat.probes = 10;

With this setup, your system gets smarter, faster, and cheaper to run, especially compared to hitting a traditional search API or re-embedding on every query.

🔚 Final Thoughts

If you're building with LLMs and not using RAG, you're limiting your model to what it already "knows." But with a pgvector-based RAG stack, you can give your AI real-time access to a library of custom, private, or fast-changing information.

No fine-tuning required. No hallucinations from outdated data. Just accurate, explainable, and context-rich responses — on demand.

🔗 Explore more:

Technology Inflection

Discussion about this post