When people hear that large language models (LLMs) like GPT-4o can answer questions based on external documents, they often assume those documents were somehow "trained into" the model. But in most modern AI applications — especially those using Retrieval-Augmented Generation (RAG) — that's not the case.
So, how does a language model retrieve accurate, relevant information it wasn’t trained on?
The answer is: it searches, using a powerful combination of vector embeddings and a vector database like pgvector, layered on top of PostgreSQL.
Let’s break down exactly what happens when you ask a question.
Step 1: Your Question Becomes a Vector
The first step is to convert your natural language query into a dense vector — a numerical representation of its meaning. This is done using a specialized embedding model like text-embedding-3-small
.
Example:
“How do I reduce my credit card debt?”
→[0.034, -0.273, 0.891, ..., -0.116]
(a 1536-dimensional vector)
This isn't magic — it's math trained on language patterns.
🔎 Step 2: pgvector Searches for Similar Meanings
Your system then issues a SQL query to pgvector, which is a PostgreSQL extension for storing and comparing vectors.
SELECT content
FROM documents
ORDER BY embedding <-> '[query_vector]'
LIMIT 5;
That <->
operator tells pgvector to compute distance between your query vector and every stored document vector, typically using cosine similarity.
Smaller distance = more semantically similar.
Step 3: pgvector Ranks the Most Relevant Content
Internally, pgvector:
Iterates through all stored embeddings (or uses an index if you're at scale)
Computes the similarity between your query and each document
Ranks the results
Returns the top-k most relevant chunks
These chunks aren’t just strings — they’re surgically sliced sections of documents (usually 256–512 tokens each), optimized during ingestion.
🧠 Step 4: The LLM Uses That Retrieved Context
Now that pgvector has served up the top matching documents, your system feeds them into an LLM like GPT-4o, using a prompt template like:
Answer the question below using the following context:
[Chunk 1: ...]
[Chunk 2: ...]
[Chunk 3: ...]
Q: How do I reduce my credit card debt?
The LLM uses this context, the list of retrieved chunks, to generate its final answer.
That’s the core of RAG: retrieval first, generation second.
🚀 Bonus: Fast, Scalable, and Flexible
pgvector supports approximate nearest neighbor search using IVFFlat
indexes, which lets you scale up to millions of documents without losing performance.
Example:
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100); SET ivfflat.probes = 10;
With this setup, your system gets smarter, faster, and cheaper to run, especially compared to hitting a traditional search API or re-embedding on every query.
🔚 Final Thoughts
If you're building with LLMs and not using RAG, you're limiting your model to what it already "knows." But with a pgvector-based RAG stack, you can give your AI real-time access to a library of custom, private, or fast-changing information.
No fine-tuning required. No hallucinations from outdated data. Just accurate, explainable, and context-rich responses — on demand.
🔗 Explore more: