RAG Evaluation: Necessity and Challenge

This Blog is Recommended For

People who are developing applications using RAG (Retrieval Augmented Generation) and are interested in evaluation
Those interested in evaluating hallucinations of LLMs (Large Language Models) and RAG
Those interested in the behavior of Ragas (Retrieval augmented generation assessment; library for evaluating RAG)

Hello, my name is Hirokazu Suzuki assisting Beatrust around ML. In these three parts of post inf the blog, I will discuss the necessity and approaches of evaluating RAG (Retrieval Augmented Generation), a particularly notable application of LLMs. I also will discuss experiments concerning the usefulness of Ragas, a specialized evaluation library for RAG.

In the experiments, I manually created all the evaluation dataset and compared my intuitive scores with the scores from Ragas to evaluate its usefulness. This blog will explain the following:

RAG Evaluation: Necessity and Challenge (This Part)
RAG Evaluation: RAG Metrics and Calculation Methods in Ragas
RAG Evaluation: Assessing the Usefulness of Ragas

Through these experiments, I found that Ragas, which can separately evaluate the generation and retrieval parts, is suitable for RAG evaluation. However, the accuracy was not as high as expected.

This blog and experiments were conducted under the guidance of ML engineer Yongtae Hwang at Beatrust.

What is RAG?

In general, LLMs including ChatGPT do not know the latest information or internal company information that are not in their training data, so they tend to generate outdated or generic answers. One way to create a custmised LLM application could be to tune the LLM model specifically for oneself, but this is not only costly but may also degrade the precision of the well-balanced LLM model.

Therefore, there is a method called RAG (Retrieval Augmented Generation), which provides reference information to LLM to generate answers. This method is becoming mainstream for customiisd LLM application due to its cost-effectiveness. Its applications are very broad, including customer support chatbots, internal document FAQs, and research based on large-scale data. Examples include McKinsey's Lilli, Moody's Research Assistant, and even governmental deployments like the UK Government's GOV.UK Chat.

RAG combines a search and retrieval technique with LLM, where relevant information is retrieved from an external database, and this information is used to generate answers. Figure 1-1 is a typical example of a RAG.

Figure 1-1 consists of the following elements:

Database construction part
1. Extract sentences from documents
2. Split sentences into appropriate lengths
3. Vectorise the split sentences
4. Store the vectorised data in a database
Generation part
1. Vectorise the question
2. Retrieve information related to the vectorised question from the vector database
3. Input the related information and query into LLM
4. LLM generates an answer

Barriers to the Adoption of RAG

The sentences generated by LLM are so smooth that as if they are written by human. However, there is no guarantee that the generated sentences are accurate, and "they might answer incorrectly as if they are correct." In appropriate answers in conventional chatbots were clearly wrong as the content was entirely different, but with LLMs, even wrong content is answered naturally as if it were correct, which can lead to users misinterpreting incorrect information as correct.

This phenomenon is known as Hallucination and poses a significant concern for the adoption of LLMs, representing one of the major hurdles to overcome.

The Need for RAG Evaluation

To address the issue of hallucination and ensure RAG can withstand real-world operation, objective and quantitative evaluation of RAG responses is crucial. The ability to quantitatively assess the accuracy of RAG responses allows for model improvements based on these metrics.

While the evaluation of LLMs themselves is a focus of cutting-edge research, RAG evaluation has only recently started to be discussed in various blogs and surveys as of early 2024. This is because the combination of retrieval and generation technologies in RAG makes objective accuracy assessment challenging.

Currently, RAG improvements tend to be based on qualitative evaluations where humans visually inspect and assess results. Below are the main points to consider for improving RAG performance, which may be of interest to those concerned.

Points for RAG Improvement

Data Format and Quality
Just like supervised learning in traditional machine learning, RAG performance heavily depends on the underlying document data. If the document content is inaccurate or biased, the RAG answers will naturally be inaccurate. The format, extraction method, and preprocessing of document data also significantly impact performance.

Chunking
Chunking involves dividing text data read from raw documents into chunks. It is common to divide text into pre-set token counts like 300 or 500, but if the token size is too small, it fails to reflect the document's context adequately. Conversely, too large a token size may include too much irrelevant context, requiring a delicate balance. The extent to which duplication is allowed is also a critical point. Duplication here refers to intentionally overlapping tokens in adjacent chunks, such as setting a 100-token overlap where the first 50 tokens of a chunk match the last 50 tokens of the previous chunk. Although not often considered, chunking has a more significant impact on performance than embedding, according to a survey, and is an important point to consider.

Embedding Model
Embedding involves converting text into vector representations. In RAG, vector-transformed text data is used to search using the similarly vector-transformed input query, making the choice of embedding model extremely important. There are large-scale models like OpenAI text-embedding-ada-002 that are developed to be versatile, but if the target domain for RAG is narrow, domain-specific models may perform better. Since domain-specific models are often cheaper than large-scale models, carefully considering the choice of embedding model according to the application is advisable.

Vector Search
RAG typically employs vector search to retrieve information from databases. For instance, LangChain implements cosine similarity and MMR as search algorithms, with both behaviours differing significantly. Additionally, the number of information pieces retrieved and used to generate LLM responses impacts RAG performance. Depending on the domain targeted by RAG, a hybrid retrieval method combining term-based search techniques like BM25 might also be considered, requiring a strategy based on the domain and data characteristics of RAG.

LLM Model
Naturally, the LLM model is also a critical point. In particular, aspects such as response time and cost are crucial for assessing suitability for RAG design.

Model architecture
It is well-known that simple vector searches do not yield high accuracy, and innovations in the entire RAG workflow are actively being pursued. This article suggests a workflow combining vector search with SQL data retrieval, which is expected to improve response speed and accuracy.

However, qualitative evaluations have their limits. Especially when humans conduct the reviews, there are inherent limits to the number of evaluations possible, and there is a tendency to focus on positive results due to bias. After releasing a RAG-based product or service, various requests from customers may arise, but having an established objective evaluation can stepwise facilitate appropriate improvements.

Quantitative Evaluation of RAG

Quantitative evaluation of RAG can consider:

Human-based evaluation
Direct evaluation by LLMs (such as GPT-4)
RAG-specific evaluation (such as Ragas)

Let's discuss the characteristics, advantages, and disadvantages of each approach.

1. Human-based Evaluation

This method involves mobilising many people to manually check the accuracy of RAG responses.

Advantages
- It is feasible with sufficient funding. Since it involves human evaluation, it most reflects the actual user experience, directly linking to UX improvements for products and services.
Disadvantages
- Mobilising many people can be too costly, and it may be challenging to mobilise individuals with specific expert knowledge in certain fields. Additionally, the speed of human evaluation is significantly slower than machines, and factors like fatigue can lead to inconsistent evaluation standards.

2. Direct Evaluation by LLMs (GPT-4)

This method involves prompting an LLM like ChatGPT to directly evaluate RAG responses. Here, we consider using GPT-4 for direct evaluation.

You are a specialist in quantitatively evaluating responses. Please rate the accuracy of the following answer on a scale of 0 to 100.

# Question
<Insert Question>

# Answer
<Insert Answer>

# Reference Text Used in the Answer
<Insert Reference Text>

Advantages
- GPT-4 can process large volumes automatically, and unlike humans, it does not suffer from fatigue or variability in evaluation standards.
Disadvantages
- Bias: GPT-4 tends to favour its own generated answers, known as self-reinforcement bias. My experience has shown that when comparing GPT-4 generated answers with ground truths, GPT-4 often claims that the ground truth is baseless and wrong, while its own answer is correct, which is nonsensical. Other biases include a preference for lengthy responses.
- Prompt Tuning: Direct evaluation by GPT involves sensitivity to prompts, and the cost of redesigning prompts for specific tasks cannot be ignored.
- Difficulty in Relative Evaluation: GPT-4 struggles with relative evaluation within a task. For instance, if the prompt is "Rate the appropriateness of this sentence from 0 to 100," GPT-4, given only that data, cannot base its evaluation on scores from other data, resulting in non-specific responses like always scoring 80 for accurate and 20 for inaccurate answers. It is difficult for GPT-4 to perform relative evaluations, such as scoring text B 60 because text A was 50.

Various methods have been developed to address these disadvantages. For example, SelfCheckGPT operates on the assumption that factual answers are more stable than fictional ones. After obtaining a response from GPT-4, it generates multiple samples, measuring the degree of agreement between the response and the samples. In the context of comparing answers, gpt-prompt-engineer searches for effective prompts by comparing two responses.

3. RAG-Specific Evaluation (Ragas etc.)

While we have discussed human and LLM-based evaluations, they both remain end-to-end evaluations and are not necessarily suitable for RAG, which combines both search and generation technologies. For instance, even if the generation part is perfect, poor search can still lead to hallucinations, and conversely, perfect search with poor generation cannot produce appropriate answers.

Therefore, as a RAG-specific evaluation method, approaches like Ragas and Anyscale's two-stage approach have been proposed. Both feature evaluations of the search and retrieval (blue part in Figure 1-2) and generation (orange part in Figure 1-2) separately.

This blog focuses particularly on Ragas, discussing its advantages and points to consider.

Advantages
- Allows for separate evaluations of search and retrieval and generation, making it easier to identify which aspects need improvement. While further details will be explained in subsequent parts, the evaluation algorithms are staged, allowing for numerical assessment.

However, there are still unexplored aspects, which were investigated through experiments in this blog:

Is Ragas less affected by the precision of the LLM itself?
Is there less variability in accuracy across different languages?
Is the evaluation algorithm so refined that it outperforms a LLM-based evaluation?

Having introduced the necessity of RAG evaluation, the next part will cover: