Building a RAG system to chat with your data: Points to consider

Sep 2, 2024

José Luis Marina

It’s not as easy as it seems

At taniwa we work on many AI and Machine Learning projects. The important thing is always the data and the care you put into the details. In other words, technology is not usually the main problem, but how you organize yourself, how you take care of the data, how you get it, and how you present it.

Building a demo of a RAG (Retrieval Augmented Generation) system with company data is straightforward; it takes 5 minutes:

Install LangChain or LlamaIndex
You chop your documents into paragraphs (chunks), convert that data into vectors, and put them into a Vector Index Database.
You put them into a Vector Index Database.
You prepare a prompt and start chatting.

But in the real world, that demo is useless. Let’s see why:

The data and its format

You always have to understand the data you are going to index. In the 5-minute example from before, we don’t know if we are dealing with texts, PDFs, tables, images, audios, videos, python code, or what.

If they are images or videos, we will probably pass them through a model that converts them into text.
If they are tables, the same. If they are audios, the same.
If it is code, the way to index it will be different.

The chunks matter

The “chunking” process is not trivial. It is not the same to chop a text into paragraphs as into sentences, or into words. In the example, the documents are being chopped into overlapping texts, so that the system can find the information you need, but that approach is very improvable.

If you have documents with sections, titles, subtitles, and paragraphs, it is important to use that information for chunking.
If you have code, use the functions, comments, variables, packages, or libraries.
A table in a document doesn’t make much sense to be chunked, but it does make sense to be indexed whole.
In some cases, you don’t have to chunk anything. Think of tweets for example.*

Preparing the prompt

Before the RAG system can answer users’ questions, they can be processed to improve the quality of the answers.

You can use an LLM that regenerates the question to make it clearer or more correct in a work environment (legal documents for example).
You can have an LLM chat with the user to clarify the question.
Using HyDE (Hypothetical Document Embedding) asks an LLM to create documents for the query and these “fake” documents are used as new queries to get embeddings and search for neighbors with the dense retriever.

Improving the part that brings the chunks

If the chunks are not good, the answer will not be good.

Hybrid Search, which combines search by nearby vectors and keyword search (ElasticSearch for example), can be a good option.
Filter the chunks by title or section first.
Try different embedding models to see which one works best (BERT, GPT, etc.).
You can make your own embedding model fine-tuned with your data.

Improving the answers:

The answers can be treated and corrected before giving them to the user.

Reranking: You can use a classification model to reorder the results. Free models are available on HuggingFace.
Eliminate chunks that do not contribute anything to the answer with an LLM.

Test and improve

Prepare a set of queries and expected answers to test the system whenever it evolves.

You can use a classification model to evaluate the quality of the answers.
You can use a text generation model to score the answers.

Architecture

Let’s not forget the architecture:

Way of serving the answers.
REST API, GraphQL, Websockets, etc.
How the data will be stored.
Security
Scalability

Conclusion

Without care, there is no quality.

Original Post

Photo by Dainis Graveris

AI
GenAI
RAG
Data Coach