How to improve your RAG performance?

Go Back

October 23, 2024 · 1546 words · 8 min read

Introduction

There are many ways to improve the performance of your RAG application, in this article, we'll explore different approaches for this.

Prompt Engineering

One of the most famous techniques to get better responses from an LLM is use prompt engineering. You can do this by using: Few-Shot Prompting, Self-Consistency & Chain Of Thought (COT) for e.g.

Let's analyze each of the ones above in the following prompts.

1. Few-Shot Prompting

This prompt technique aims to provide better responses by giving the model examples, you must give some examples for the model so then he can generate a response based on what you ask for and based on the given examples.

Input

A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is:
We were traveling in Africa and we saw these very cute whatpus.
 
To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses the word farduddle is:

Output

When we won the game, we all started to farduddle in celebration.

2. Self-Consistency

Perhaps one of the more advanced techniques out there for prompt engineering is self-consistency. Proposed by Wang et al. (2022), self-consistency aims "to replace the naive greedy decoding used in chain-of-thought prompting". The idea is to sample multiple, diverse reasoning paths through few-shot CoT, and use the generations to select the most consistent answer. This helps to boost the performance of CoT prompting on tasks involving arithmetic and commonsense reasoning.

Input

Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done,
there will be 21 trees. How many trees did the grove workers plant today?
A: We start with 15 trees. Later we have 21 trees. The difference must be the number of trees they planted.
So, they must have planted 21 - 15 = 6 trees. The answer is 6.

Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
A: There are 3 cars in the parking lot already. 2 more arrive. Now there are 3 + 2 = 5 cars. The answer is 5.

Q: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
A: Leah had 32 chocolates and Leah’s sister had 42. That means there were originally 32 + 42 = 74
chocolates. 35 have been eaten. So in total they still have 74 - 35 = 39 chocolates. The answer is 39.

Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops
did Jason give to Denny?
A: Jason had 20 lollipops. Since he only has 12 now, he must have given the rest to Denny. The number of
lollipops he has given to Denny must have been 20 - 12 = 8 lollipops. The answer is 8.

Q: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does
he have now?
A: He has 5 toys. He got 2 from mom, so after that he has 5 + 2 = 7 toys. Then he got 2 more from dad, so
in total he has 7 + 2 = 9 toys. The answer is 9.

Q: There were nine computers in the server room. Five more computers were installed each day, from
monday to thursday. How many computers are now in the server room?
A: There are 4 days from monday to thursday. 5 computers were added each day. That means in total 4 * 5 =
20 computers were added. There were 9 computers in the beginning, so now there are 9 + 20 = 29 computers.
The answer is 29.

Q: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many
golf balls did he have at the end of wednesday?
A: Michael initially had 58 balls. He lost 23 on Tuesday, so after that he has 58 - 23 = 35 balls. On
Wednesday he lost 2 more so now he has 35 - 2 = 33 balls. The answer is 33.

Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
A: She bought 5 bagels for $3 each. This means she spent $15. She has $8 left.

Q: When I was 6 my sister was half my age. Now I’m 70 how old is my sister?
A:

Output

When the narrator was 6, his sister was half his age, which is 3. Now that the narrator is 70, his sister would be 70 - 3 = 67 years old. The answer is 67.

3. Chain of Thought

Chain-of-thought (CoT) prompting enables complex reasoning capabilities through intermediate reasoning steps. You can combine it with few-shot prompting to get better results on more complex tasks that require reasoning before responding.

Input

The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
A: Adding all the odd numbers (9, 15, 1) gives 25. The answer is False.

The odd numbers in this group add up to an even number: 17,  10, 19, 4, 8, 12, 24.
A: Adding all the odd numbers (17, 19) gives 36. The answer is True.

The odd numbers in this group add up to an even number: 16,  11, 14, 4, 8, 13, 24.
A: Adding all the odd numbers (11, 13) gives 24. The answer is True.

The odd numbers in this group add up to an even number: 17,  9, 10, 12, 13, 4, 2.
A: Adding all the odd numbers (17, 9, 13) gives 39. The answer is False.

The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. 
A:

Output

Adding all the odd numbers (15, 5, 13, 7, 1) gives 41. The answer is False.

Filtered Search

Filtered search mainly aims to reduce the retrieval time of your RAG. A rag basically has 2 models: one that will generate the answer and one that will search for the relevant docs in the vector store.

This can be done by multiple ways, the path that I will treat in this article is focused in AWS RAG Applications using Bedrock and OpenSearch, but it can be applied in other cloud platforms. In Amazon OpenSearch it's possible to create metadata files for each file in your VectorDB. These metadata archives will help you to filter which documents the LLM needs to fetch given a user question.

The metadata files will contain the attributes that you want the AI to use to perform the search. The file format will be the following:

{
"metadataAttributes": {
"${attribute1}": "${value1}",
"${attribute2}": "${value2}",
...
}
}

Each of the files of your vector database needs to have an metadata file, this metadata file will contain the attributes that the AI will use to fetch for relevant docs. This technique will precisely decrease the response time and can even improve the model accuracy, since now the model doesn't have to fetch for ALL the files base.

The following e.g. code, shows how to given an question, the model will search ONLY for the docs which have the key attributes defined on the code:

def retrieve(query, kbId, numberOfResults=5):
    return bedrock_agent_client.retrieve(
        retrievalQuery= {
            'text': query
        },
        knowledgeBaseId=kbId,
        retrievalConfiguration= {
            'vectorSearchConfiguration': {
                'numberOfResults': numberOfResults,
                 "filter": {
                            'andAll':[
                                {
                                "equalsTo": {
                                "key": "attribute1",
                                "value": x
                                }
                            },
                                {
                            "lessThan": {
                                "key": "attribute2",
                                "value": y
                            }
                                }
                            ]
                        }
            }
        }
    ) 
query = "E.G. question" 
response = retrieve(query, kb_id, 2)
retrievalResults = response['retrievalResults']
pp.pprint(retrievalResults)

So, the code above receive a question and based on it, search for relevant docs, but only the docs which has the specific values in attribute1 and attribute2, docs who have different values for the specified attributes are not considered in the search.

Score Extractor

The main framework to build RAG Applications is definitly LangChain. Whatever the project you are building, at some moment you will probably use it. And a nice thing that Langchain provides is the Retriever Score.

What is this?

The retriever score basically is a score that the LLM embeddings model uses to rank each of the relevant docs retrieved in a seach. So let's say that I made a question to an LLM, and he answered me. To built this answer, the model had to search and read the content of 3 different docs. After reading each of the docs, the model defines a similarity score for each doc, and the bigger the relevant of the doc, the bigger the score is.

Regularly, good answers have at least a 0.45 score. When the score is lower then 0.45, the AI will probably hallucinate while generating the answer, to avoid that, you can create a rule that returns a generic answer in cases where the score is lower than 0.45.