RAG / Question Answer Example

In RAG (retrieval-augmented generation or "question and answer") applications, the high level goal is:

Given a question, generate an answer that adheres to knowledge in some corpus

However, this is easier said than done. Data is often collected at various steps in the RAG process to help evaluate which steps might be performing poorly or not as expected. This data can help understand the following:

What question was asked?
Which documents / chunks (ids) were retrieved?
What was the text of those retrieved documents / chunks?
From the retrieved documents, what was the top-ranked document and its id?
What is the expected answer?
What is the expected document id and text that contains the answer to the question?
What was the generated answer?

Having data that answers some or all of these questions allows for evaluations to run, producing metrics that can highlight what part of the RAG system is performing in unexpected ways.

The short example below demonstrates what a dataframe with rich contextual data would look like for and how to use dbnl.eval to generate relevant metrics

import dbnl
import os
import pandas as pd
from openai import OpenAI
from dbnl.eval.llm import OpenAILLMClient
from dbnl.eval.embedding_clients import OpenAIEmbeddingClient
from dbnl.eval import evaluate

# 1. create client to power LLM-as-judge and embedding metrics [optional]
base_oai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
eval_llm_client = OpenAILLMClient.from_existing_client(base_oai_client, llm_model="gpt-3.5-turbo-0125")
eval_embd_client = OpenAIEmbeddingClient.from_existing_client(base_oai_client, embedding_model="text-embedding-ada-002")

eval_df = pd.DataFrame(
    [
        {
         "question_text": "Is the protein Cathepsin secreted?",
         "top_k_retrieved_doc_texts": ["Some irrelevant document that the rag system retrieved"],    
         "top_k_retrieved_doc_ids":   ["4123"],
         "top_retrieved_doc_text":    "Some irrelevant document that the rag system retrieved",
         "gt_reference_doc_id": "1099",
         "gt_reference_doc_text": "The protein Cathepsin is known to be secreted",
         "ground_truth_answer": "Yes, Cathepsin is a secreted protein",
         "generated_answer":"I have no relevant knowledge",},
        {
         "question_text": "Is the protein Cathepsin secreted?",
         "top_k_retrieved_doc_texts": ["Some irrelevant document that the rag system retrieved", 
                                       "Many proteins are secreted such as hormones, enzymes, toxins",
                                       "The protein Cathepsin is known to be secreted"],    
         "top_k_retrieved_doc_ids":   ["4123","21","1099"],
         "top_retrieved_doc_text":    "Some irrelevant document that the rag system retrieved",
         "gt_reference_doc_id": "1099",
         "gt_reference_doc_text": "The protein Capilin is known to be secreted",
         "ground_truth_answer": "Yes, Cathepsin is a secreted protein",
         "generated_answer":"Many proteins are known to be secreted",},
        {
         "question_text": "Is the protein Cathepsin secreted?",
         "top_k_retrieved_doc_texts": ["The protein Cathepsin is known to be secreted", 
                                       "Some irrelevant document that the rag system retrieved"],    
         "top_k_retrieved_doc_ids":   ["1099","4123"],
         "top_retrieved_doc_text":    "The protein Cathepsin is known to be secreted",
         "gt_reference_doc_id": "1099",
         "gt_reference_doc_text": "The protein Cathepsin is known to be secreted",
         "ground_truth_answer": "Yes, Cathepsin is a secreted protein",
         "generated_answer":"Yes, cathepsin is a secreted protein",},
    ] * 4
)

# 2. get text metrics appropriate for RAG / QA systems
qa_text_metrics = dbnl.eval.metrics.question_and_answer_metrics(
  prediction="generated_answer", input="question_text", target="ground_truth_answer",
  context="top_k_retrieved_doc_texts", top_retrieved_document_text="top_retrieved_doc_text",
  retrieved_document_ids="top_k_retrieved_doc_ids", ground_truth_document_id="gt_reference_doc_id",
  eval_llm_client=eval_llm_client, eval_embedding_client=eval_embd_client
)
# 3. run qa text metrics
aug_eval_df = evaluate(eval_df, qa_text_metrics)

# 4. publish to DBNL
dbnl.login()
project = dbnl.get_or_create_project(name="RAG_demo")
cols = dbnl.experimental.get_column_schemas_from_dataframe(aug_eval_df)
run_config = dbnl.create_run_config(project=project, columns=cols)
run = dbnl.create_run(project=project, run_config=run_config)
dbnl.report_results(run=run, data=aug_eval_df)
dbnl.close_run(run=run)

You can inspect a subset of the the aug_eval_df rows and examine, for example, the metrics related to retrieval and answer similarity

idx

mrr__gt_reference_doc_id__top_k_retrieved_doc_ids

answer_similarity_v0__generated_answer_question_text_ground_truth_answer

0.0

0.33333

1.0

We can see the first result (idx = 0) represents a complete failure of the RAG system. The relevant documents were not retrieved (mrr = 0.0) and the generated answer is very dissimilar from the expected answer (answer_similarity = 1).

The second result (idx = 1) represents a better response from the RAG system. The relevant document was retrieved, but ranked lower (mrr = 0.33333) and the answer is somewhat similar to the expected answer (answer_similarity = 3)

The final result (idx = 2) represents a strong response from the RAG system. The relevant document was retrieved and top ranked (mrr = 1.0) and the generated answer is very similar to the expected answer (answer_similarity = 5)

The signature for question_and_answer_metrics() highlights its adaptability. Again, the optional arguments are not required and the helper will intelligently return only the metrics that depend on the info that is provided.

def question_and_answer_metrics(
    prediction: str,
    target: Optional[str] = None,
    input: Optional[str] = None,
    context: Optional[str] = None,
    ground_truth_document_id: Optional[str] = None,
    retrieved_document_ids: Optional[str] = None,
    ground_truth_document_text: Optional[str] = None,
    top_retrieved_document_text: Optional[str] = None,
    eval_llm_client: Optional[LLMClient] = None,
    eval_embedding_client: Optional[EmbeddingClient] = None,
) -> list[Metric]:
    """
    Returns a set of metrics relevant for a question and answer task.

    :param prediction: prediction column name (i.e. generated answer)
    :param target: target column name (i.e. expected answer)
    :param input: input column name (i.e. question)
    :param context: context column name (i.e. document or set of documents retrieved)
    :param ground_truth_document_id: ground_truth_document_id containing the information in the target
    :param retrieved_document_ids: retrieved_document_ids containing the full context
    :param ground_truth_document_text: text containing the information in the target 
                                       (ideal is for this to be the top retrieved document)
    :param top_retrieved_document_text: text of the top retrieved document
    :param eval_llm_client: eval_llm_client
    :param eval_embedding_client: eval_embedding_client
    :return: list of metrics
    """

PreviousLLM-as-judge and Embedding Metrics NextEval Module Functions

Was this helpful?