Semantic search and RAG¶
This section started with some slides and background on embeddings.
One of the most popular applications of LLMs is to build “ask questions of my own documents” systems.
You do not need to fine-tune a model for this. Instead, you can use a pattern called retrieval-augmented generation (RAG).
The key trick to RAG is simple: try to figure out the most relevant documents for the user’s question and stuff as many of them as possible into the prompt.
Long context models make this even more effective. “Reasoning” models may actually be less effective here.
Semantic search¶
If somebody asks “can my pup eat brassicas”, how can we ensure we pull in documents that talk about dogs eating sprouts?
One solution is to build semantic search on top of vector embeddings.
I wrote about those in Embeddings: What they are and why they matter.
Generating embeddings with LLM¶
LLM includes a suite of tools for working with embeddings, plus various plugins that add new embedding models.
We’ll start with the OpenAI hosted embedding model.
llm embed -m text-embedding-3-small-512 -c "can my pup eat brassicas?"
This returns a 256 long vector of floats.
More useful is if we store some information first. Let’s embed all the titles and descriptions of all CMSC courses:
uv run llm embed-multi courses \
-d cmsc_course_embeddings.db \
--attach umd umd-202508.db \
--sql "select course_id, title, description from umd.course_info where department='CMSC'" \
--store \
-m text-embedding-3-small-512
There’s a lot going on there. We’re using the 512 long text-embedding-3-small-512 model, saving embeddings to a cmsc_course_embeddings.db SQLite database in a courses collection, using the title and description for each CMSC course. and storing the titles and descriptions along with their vectors.
Confirmed with:
llm collections -d cmsc_course_embeddings.db
And now we can search that collection for items similar to a term using:
llm similar -c 'artificial intelligence' -d cmsc_course_embeddings.db courses | jq
Here’s a SQLite database with ALL of the course descriptions, so that our class doesn’t burn through my API credits with everyone embedding the same data!
https://www.cs.umd.edu/class/fall2025/cmsc398z/files/course_embeddings.db (16 MB)
Answering questions against those course descriptions¶
This time we’ll build a bash script:
llm '
Build me a bash script like this:
./umd-qa.sh "What courses discuss artificial intelligence?"
It should first run:
llm similar -c $question -d course_embeddings.db courses
Then it should pipe the output from that to:
llm -s "Answer the question: $question" -m gpt-5-mini
That last command should run so the output is visible as it runs.
' -x > umd-qa.sh
chmod +x umd-qa.sh
./umd-qa.sh "What do string templates look like?"
I got this.
Usage example: ./umd-qa.sh “What courses discuss artificial intelligence?”
Notes:
The script joins all provided arguments into the question (so you can pass a multiword question without extra quoting issues).
For the best streaming behavior, install coreutils (for stdbuf) or expect (for unbuffer). On some systems script is already available and will also work.
Running code that an LLM has generated without first reviewing it generally a terrible idea!
If you want to port the above to Python you should consult the Working with collections section of LLM’s Python API documentation.
RAG does not have to use semantic search!¶
Many people associate RAG with embedding-based semantic search, but it’s not actually a requirement for the pattern.
A lot of the most successful RAG systems out there use traditional keyword search instead. Models are very good at taking a user’s question and turning it into one or more search queries.
Don’t get hung up on embeddings!
RAG is dead?¶
Every time a new long-context model comes out, someone will declare the death of RAG.
I think classic RAG is dead for a different reason: it turns out arming an LLM with search tools is a much better way to achieve the same goal.
Which brings us to our next topic: Tool usage.