1 of 10

Eval Module

Many generative AI applications focus on text generation. It can be challenging to create metrics for insights into expected performance when dealing with unstructured text.

dbnl.eval is a special module designed for evaluating unstructured text. This module currently includes:

Adaptive metric sets for generic text and RAG applications
12+ simple statistical local library powered text metrics
15+ LLM-as-judge and embedding powered text metrics
Support for user-defined custom LLM-as-judge metrics
LLM-as-judge metrics compatible with OpenAI, Azure OpenAI

Building dbnl tests on these evaluation metrics can then drive rich insights into an AI application's stability and performance.

Quick Start

To use dbnl.eval, you will need to install the extra 'eval' package as described in these instructions.

Create a client to power LLM-as-judge text metrics [optional]
Generate a list of metrics suitable for comparing text_A to reference text_B
Use dbnl.eval to evaluate to compute the list metrics.
Publish the augmented dataframe and new metric quantities to DBNL

import dbnl
import os
import pandas as pd
from openai import OpenAI
from dbnl.eval.llm import OpenAILLMClient
from dbnl.eval import evaluate

# 1. create client to power LLM-as-judge metrics
base_oai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
oai_client = OpenAILLMClient.from_existing_client(base_oai_client, llm_model="gpt-3.5-turbo-0125")

eval_df = pd.DataFrame(
    [
        { "prediction":"France has no capital",
          "ground_truth": "The capital of France is Paris",},
        { "prediction":"The capital of France is Toronto",
          "ground_truth": "The capital of France is Paris",},
        { "prediction":"Paris is the capital",
          "ground_truth": "The capital of France is Paris",},
    ] * 4
)

# 2. get text metrics that use target (ground_truth) and LLM-as-judge metrics
text_metrics = dbnl.eval.metrics.text_metrics(
    prediction="prediction", target="ground_truth", eval_llm_client=oai_client
)
# 3. run text metrics that use target (ground_truth) and LLM-as-judge metrics
aug_eval_df = evaluate(eval_df, text_metrics)

# 4. publish to DBNL
dbnl.login(api_token=os.environ["DBNL_API_TOKEN"])
project = dbnl.get_or_create_project(name="DEAL_testing")
cols = dbnl.experimental.get_column_schemas_from_dataframe(aug_eval_df)
run_config = dbnl.create_run_config(project=project, columns=cols)
run = dbnl.create_run(project=project, run_config=run_config)
dbnl.report_results(run=run, data=aug_eval_df)
dbnl.close_run(run=run)

You can inspect a subset of the the aug_eval_df rows and for example, one of the columns created by one of the metrics in the text_metrics list : llm_text_similarity_v0

idx

prediction

ground_truth

llm_text_similarity_v0__prediction__ground_truth

France has no capital

The capital of France is Paris

The capital of France is Toronto

The capital of France is Paris

Paris is the capital

The capital of France is Paris

The values of llm_text_similarity_v0qualitatively match our expectations on semantic similarity between the prediction and ground_truth

The call to evaluate() takes a dataframe and metric list as input and returns a dataframe with extra columns. Each new column holds the value of a metric computation for that row

def evaluate(df: pd.DataFrame, metrics: Sequence[Metric], inplace: bool = False) -> pd.DataFrame:
    """
    Evaluates a set of metrics on a dataframe, returning an augmented dataframe.

    :param df: input dataframe
    :param metrics: metrics to compute
    :param inplace: whether to modify the input dataframe in place
    :return: input dataframe augmented with metrics
    """

The column names of the metrics in the returned dataframe include the metric name and the columns that were used in that metrics computation

For example the metric named llm_text_similarity_v0 becomes llm_text_similarity_v0__prediction__ground_truth because it takes as input both the column named prediction and the column named ground_truth

Application Metric Sets

The metric set helpers return an adaptive list of metrics, relevant to the application type

`text_metrics()`

Basic metrics for generic text comparison and monitoring

token_count
word_count
flesch_kincaid_grade
automated_readability_index
bleu
levenshtein
rouge1
rouge2
rougeL
rougeLsum
llm_text_toxicity_v0
llm_sentiment_assessment_v0
llm_reading_complexity_v0
llm_grammar_accuracy_v0
inner_product
llm_text_similarity_v0

`question_and_answer_metrics()`

Basic metrics for RAG / question answering

llm_accuracy_v0
llm_completeness_v0
answer_similarity_v0
faithfulness_v0
mrr
context_hit

The metric set helpers are adaptive in that :

The metrics returned encode which columns of the dataframe are input to the metric computation e.g., rougeL_prediction__ground_truth is the rougeL metric run with both the column named prediction and the column named ground_truth as input
The metrics returned support any additional optional column info and LLM-as-judge or embedding model clients. If any of this optional info is not provided, the metric set will exclude any metrics that depend on that information

def text_metrics(
    prediction: str,
    target: Optional[str] = None,
    eval_llm_client: Optional[LLMClient] = None,
    eval_embedding_client: Optional[EmbeddingClient] = None,
) -> list[Metric]:
    """
    Returns a set of metrics relevant for a generic text application

    :param prediction: prediction column name (i.e. generated text)
    :param target: target column name (i.e. expected text)
    :return: list of metrics
    """

See the How-To section for concrete examples of adaptive text_metrics() usage

See the RAG example for question_and_answer_metrics() usage

How-To / FAQ

What if I do not have an LLM service to run LLM-as-judge metrics?

No problem, just don’t include an eval_llm_client or an eval_embedding_client argument in the call(s) to the evaluation helpers. The helpers will automatically exclude any metrics that depend on them.

# BEFORE : default text metrics including those requiring target (ground_truth) and LLM-as-judge
text_metrics = dbnl.eval.metrics.text_metrics(
    prediction="prediction", target="ground_truth", eval_llm_client=oai_client
)

# AFTER : remove the eval_llm_client to exclude LLM-as-judge metrics
text_metrics = dbnl.eval.metrics.text_metrics(
    prediction="prediction", target="ground_truth"
)

aug_eval_df = evaluate(eval_df, text_metrics)

What if I do not have ground-truth available?

No problem. You can simply remove the target argument from the helper. The metric set helper will automatically exclude any metrics that depend on the target column being specified.

# BEFORE : default text metrics, including those requiring target (ground_truth) and LLM-as-judge
text_metrics = dbnl.eval.metrics.text_metrics(
    prediction="prediction", target="ground_truth", eval_llm_client=oai_client
)

# AFTER : remove the target to remove metrics that depend on that value being specified
text_metrics = dbnl.eval.metrics.text_metrics(
    prediction="prediction", eval_llm_client=oai_client
)

aug_eval_df = evaluate(eval_df, text_metrics)

There is an additional helper that can generate a list of generic metrics appropriate for “monitoring” unstructured text columns : text_monitor_metrics(). Simply provide a list of text column names and optionally an eval_llm_client for LLM-as-judge metrics.

# get text metrics for each column in list
monitor_metrics = dbnl.eval.metrics.text_monitor_metrics(
  ["prediction", "input"], eval_llm_client=oai_client
)

aug_eval_df = evaluate(eval_df, monitor_metrics)

How do I create a custom LLM-as-judge metric?

You can write your own LLM-as-judge metric that uses your custom prompt. The example below defines a custom LLM-as-judge metric and runs it on an example dataframe.

import dbnl
import os
import pandas as pd
from openai import OpenAI
from dbnl.eval.llm import OpenAILLMClient
from dbnl.eval import evaluate
from dbnl.eval.metrics.mlflow import MLFlowGenAIFromPromptEvaluationMetric
from dbnl.eval.metrics.metric import Metric
from dbnl.eval.llm.client import LLMClient

# 1. create client to power LLM-as-judge metrics
base_oai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
oai_client = OpenAILLMClient.from_existing_client(base_oai_client, llm_model="gpt-3.5-turbo-0125")

eval_df = pd.DataFrame(
    [
        { "prediction":"France has no capital",
          "ground_truth": "The capital of France is Paris",},
        { "prediction":"The capital of France is Toronto",
          "ground_truth": "The capital of France is Paris",},
        { "prediction":"Paris is the capital",
          "ground_truth": "The capital of France is Paris",},
    ] * 4
)

# 2. define a custom LLM-as-judge metric
def custom_text_similarity(prediction: str, target: str, eval_llm_client: LLMClient) -> Metric:
    custom_prompt_v0 = """
      Given the generated text : {prediction}, score the semantic similarity to the reference text : {target}. 

      Rate the semantic similarity from 1 (completely different meaning and facts between the generated and reference texts) to 5 (nearly the exact same semantic meaning and facts present in the generated and reference texts).

      Example output, make certain that 'score:' and 'justification:' text is present in output:
      score: 4
      justification: XYZ
    """
    
    return MLFlowGenAIFromPromptEvaluationMetric(
        name="custom_text_similarity",
        judge_prompt=custom_prompt_v0,
        prediction=prediction,
        target=target,
        eval_llm_client=eval_llm_client,
        version="v0",
    )

# 3. instantiate the custom LLM-as-judge metric
c_metric = custom_text_similarity(
  prediction='prediction', target='ground_truth', eval_llm_client=oai_client
)
# 4. run only the custom LLM-as-judge metric
aug_eval_df = evaluate(eval_df, [c_metric])

You can also write a metric that includes only the prediction column specified and reference only {prediction} in the custom prompt. An example is below:

def custom_text_simplicity(prediction: str, target: str, eval_llm_client: LLMClient) -> Metric:
    custom_prompt_v0 = """
      Given the generated text : {prediction}, score the text from 1 to 5 based on whether it is written in simple, easy to understand english 

      Rate the generated text from 5 (completely simple english, very commonly used words, easy to explain vocabulary) to 1 (complex english, uncommon words, difficult to explain vocabulary).

      Example output, make certain that 'score:' and 'justification:' text is present in output:
      score: 4
      justification: XYZ
    """
    
    return MLFlowGenAIFromPromptEvaluationMetric(
        name="custom_text_simplicity",
        judge_prompt=custom_prompt_v0,
        prediction=prediction,
        target=target,
        eval_llm_client=eval_llm_client,
        version="v0",
    )

LLM-as-judge and Embedding Metrics

A common strategy for evaluating unstructured text application is to use other LLMs and text embedding models to drive metrics of interest.

Supported LLM and model services

The LLM-as-judge text metrics in dbnl.eval support OpenAI, Azure OpenAI and any other third-party LLM / embedding model provider that is compatible with the OpenAI python client. Specifically, third-party endpoints should (mostly) adhere to the schema of:

v1/chat/completions endpoint for LLMs
v1/embeddings endpoint for embedding models

The following examples show how to initialize an llm_eval_client and an eval_embedding_client under different providers.

OpenAI

from openai import OpenAI
from dbnl.eval.llm import OpenAILLMClient
from dbnl.eval.embedding_clients import OpenAIEmbeddingClient

# create client for LLM-as-judge metrics
base_oai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
eval_llm_client = OpenAILLMClient.from_existing_client(
    base_oai_client, llm_model="gpt-3.5-turbo-0125"
)

embd_client = OpenAIEmbeddingClient.from_existing_client(
    base_oai_client, embedding_model="text-embedding-ada-002"
)

Azure OpenAI

from openai import AzureOpenAI
from dbnl.eval.llm import AzureOpenAILLMClient
from dbnl.eval.embedding_clients import AzureOpenAIEmbeddingClient

base_azure_oai_client = AzureOpenAI(
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version=os.environ["OPENAI_API_VERSION"], # eg 2023-12-01-preview
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"] # eg https://resource-name.openai.azure.com
)
eval_llm_client = AzureOpenAILLMClient.from_existing_client(
    base_azure_oai_client, llm_model="gpt-35-turbo-16k"
)
embd_client = AzureOpenAIEmbeddingClient.from_existing_client(
    base_azure_oai_client, embedding_model="text-embedding-ada-002"
)

TogetherAI (or other OpenAI compatible service / endpoints)

from openai import OpenAI
from dbnl.eval.llm import OpenAILLMClient
base_oai_client = OpenAI(
    api_key=os.environ["TOGETHERAI_API_KEY"],
    base_url="https://api.together.xyz/v1",
)

eval_llm_client = OpenAILLMClient.from_existing_client(
    base_oai_client, llm_model='meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo'
)

Missing Metric Values

It is possible for some of the LLM-as-judge metrics to occasionally return values that are unable to be parsed. These metrics values will surface as None

Distributional is able to accept dataframes including None values. The platform will intelligently filter them when applicable.

Throughput and Rate Limits

LLM service providers often impose request rate limits and token throughput caps. Some example errors that one might encounter are shown below:

{'code': '429', 'message': 'Requests to the Embeddings_Create Operation under 
  Azure OpenAI API version XXXX have exceeded call rate limit of your current 
  OpenAI pricing tier. Please retry after 86400 seconds. 
  Please go here: https://aka.ms/oai/quotaincrease if you would 
  like to further increase the default rate limit.'}

{'message': 'You have been rate limited. Your rate limit is YYY queries per
minute. Please navigate to https://www.together.ai/forms/rate-limit-increase 
to request a rate limit increase.', 'type': 'credit_limit', 
'param': None, 'code': None}

{'message': 'Rate limit reached for gpt-4 in organization XXXX on 
tokens per min (TPM): Limit WWWWW, Used YYYY, Requested ZZZZ. 
Please try again in 1.866s. Visit https://platform.openai.com/account/rate-limits 
to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}

In the event you experience these errors, please work with your LLM service provider to adjust your limits. Additionally, feel free to reach out to Distributional support with the issue you are seeing.

RAG / Question Answer Example

In RAG (retrieval-augmented generation or "question and answer") applications, the high level goal is:

Given a question, generate an answer that adheres to knowledge in some corpus

However, this is easier said than done. Data is often collected at various steps in the RAG process to help evaluate which steps might be performing poorly or not as expected. This data can help understand the following:

What question was asked?
Which documents / chunks (ids) were retrieved?
What was the text of those retrieved documents / chunks?
From the retrieved documents, what was the top-ranked document and its id?
What is the expected answer?
What is the expected document id and text that contains the answer to the question?
What was the generated answer?

Having data that answers some or all of these questions allows for evaluations to run, producing metrics that can highlight what part of the RAG system is performing in unexpected ways.

The short example below demonstrates what a dataframe with rich contextual data would look like for and how to use dbnl.eval to generate relevant metrics

import dbnl
import os
import pandas as pd
from openai import OpenAI
from dbnl.eval.llm import OpenAILLMClient
from dbnl.eval.embedding_clients import OpenAIEmbeddingClient
from dbnl.eval import evaluate

# 1. create client to power LLM-as-judge and embedding metrics [optional]
base_oai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
eval_llm_client = OpenAILLMClient.from_existing_client(base_oai_client, llm_model="gpt-3.5-turbo-0125")
eval_embd_client = OpenAIEmbeddingClient.from_existing_client(base_oai_client, embedding_model="text-embedding-ada-002")

eval_df = pd.DataFrame(
    [
        {
         "question_text": "Is the protein Cathepsin secreted?",
         "top_k_retrieved_doc_texts": ["Some irrelevant document that the rag system retrieved"],    
         "top_k_retrieved_doc_ids":   ["4123"],
         "top_retrieved_doc_text":    "Some irrelevant document that the rag system retrieved",
         "gt_reference_doc_id": "1099",
         "gt_reference_doc_text": "The protein Cathepsin is known to be secreted",
         "ground_truth_answer": "Yes, Cathepsin is a secreted protein",
         "generated_answer":"I have no relevant knowledge",},
        {
         "question_text": "Is the protein Cathepsin secreted?",
         "top_k_retrieved_doc_texts": ["Some irrelevant document that the rag system retrieved", 
                                       "Many proteins are secreted such as hormones, enzymes, toxins",
                                       "The protein Cathepsin is known to be secreted"],    
         "top_k_retrieved_doc_ids":   ["4123","21","1099"],
         "top_retrieved_doc_text":    "Some irrelevant document that the rag system retrieved",
         "gt_reference_doc_id": "1099",
         "gt_reference_doc_text": "The protein Capilin is known to be secreted",
         "ground_truth_answer": "Yes, Cathepsin is a secreted protein",
         "generated_answer":"Many proteins are known to be secreted",},
        {
         "question_text": "Is the protein Cathepsin secreted?",
         "top_k_retrieved_doc_texts": ["The protein Cathepsin is known to be secreted", 
                                       "Some irrelevant document that the rag system retrieved"],    
         "top_k_retrieved_doc_ids":   ["1099","4123"],
         "top_retrieved_doc_text":    "The protein Cathepsin is known to be secreted",
         "gt_reference_doc_id": "1099",
         "gt_reference_doc_text": "The protein Cathepsin is known to be secreted",
         "ground_truth_answer": "Yes, Cathepsin is a secreted protein",
         "generated_answer":"Yes, cathepsin is a secreted protein",},
    ] * 4
)

# 2. get text metrics appropriate for RAG / QA systems
qa_text_metrics = dbnl.eval.metrics.question_and_answer_metrics(
  prediction="generated_answer", input="question_text", target="ground_truth_answer",
  context="top_k_retrieved_doc_texts", top_retrieved_document_text="top_retrieved_doc_text",
  retrieved_document_ids="top_k_retrieved_doc_ids", ground_truth_document_id="gt_reference_doc_id",
  eval_llm_client=eval_llm_client, eval_embedding_client=eval_embd_client
)
# 3. run qa text metrics
aug_eval_df = evaluate(eval_df, qa_text_metrics)

# 4. publish to DBNL
dbnl.login()
project = dbnl.get_or_create_project(name="RAG_demo")
cols = dbnl.experimental.get_column_schemas_from_dataframe(aug_eval_df)
run_config = dbnl.create_run_config(project=project, columns=cols)
run = dbnl.create_run(project=project, run_config=run_config)
dbnl.report_results(run=run, data=aug_eval_df)
dbnl.close_run(run=run)

You can inspect a subset of the the aug_eval_df rows and examine, for example, the metrics related to retrieval and answer similarity

We can see the first result (idx = 0) represents a complete failure of the RAG system. The relevant documents were not retrieved (mrr = 0.0) and the generated answer is very dissimilar from the expected answer (answer_similarity = 1).

The second result (idx = 1) represents a better response from the RAG system. The relevant document was retrieved, but ranked lower (mrr = 0.33333) and the answer is somewhat similar to the expected answer (answer_similarity = 3)

The final result (idx = 2) represents a strong response from the RAG system. The relevant document was retrieved and top ranked (mrr = 1.0) and the generated answer is very similar to the expected answer (answer_similarity = 5)

The signature for question_and_answer_metrics() highlights its adaptability. Again, the optional arguments are not required and the helper will intelligently return only the metrics that depend on the info that is provided.

def question_and_answer_metrics(
    prediction: str,
    target: Optional[str] = None,
    input: Optional[str] = None,
    context: Optional[str] = None,
    ground_truth_document_id: Optional[str] = None,
    retrieved_document_ids: Optional[str] = None,
    ground_truth_document_text: Optional[str] = None,
    top_retrieved_document_text: Optional[str] = None,
    eval_llm_client: Optional[LLMClient] = None,
    eval_embedding_client: Optional[EmbeddingClient] = None,
) -> list[Metric]:
    """
    Returns a set of metrics relevant for a question and answer task.

    :param prediction: prediction column name (i.e. generated answer)
    :param target: target column name (i.e. expected answer)
    :param input: input column name (i.e. question)
    :param context: context column name (i.e. document or set of documents retrieved)
    :param ground_truth_document_id: ground_truth_document_id containing the information in the target
    :param retrieved_document_ids: retrieved_document_ids containing the full context
    :param ground_truth_document_text: text containing the information in the target 
                                       (ideal is for this to be the top retrieved document)
    :param top_retrieved_document_text: text of the top retrieved document
    :param eval_llm_client: eval_llm_client
    :param eval_embedding_client: eval_embedding_client
    :return: list of metrics
    """

Eval Module Functions

Index of functions

eval
eval.metrics

eval

Functions in the dbnl.eval module.

dbnl.eval.evaluate(df: DataFrame, metrics: Sequence[Metric], inplace: bool = False) → DataFrame

Evaluates a set of metrics on a dataframe, returning an augmented dataframe.

Parameters:
- df – input dataframe
- metrics – metrics to compute
- inplace – whether to modify the input dataframe in place
Returns: input dataframe augmented with metrics

dbnl.eval.get_column_schemas_from_dataframe_and_metrics(df: DataFrame, metrics: list[Metric]) → list[ColumnSchema]

Get the run config column schemas for a dataframe that was augmented with a list of metrics.

Parameters:
- df – Dataframe to get column schemas from
- metrics – list of metrics added to the dataframe
Returns: list of columns schemas for dataframe and metrics

dbnl.eval.get_column_schemas_from_metrics(metrics: list[Metric]) -> list[ColumnSchema]

Get the run config column schemas from a list of metrics.

Parameters: metrics – list of metrics to get column schemas from
Returns: list of column schemas for metrics

eval.metrics

Classes and methods in dbnl.eval.metrics.

class dbnl.eval.metrics.Metric

column_schema() → ColumnSchema

Returns the column schema for the metric to be used in a run config.

Returns: _description_
Return type: ColumnSchema

description() → str | None

Returns the description of the metric.

Returns: Description of the metric.

abstract evaluate(df: pd.DataFrame) → pd.Series[Any]

Evaluates the metric over the provided dataframe.

Parameters: df – Input data from which to compute metric.
Returns: Metric values.

abstract expression() → str

Returns the expression representing the metric (e.g. rouge1(prediction, target)).

Returns: Metric expression.

greater_is_better() → bool | None

If true, larger values are assumed to be directionally better than smaller once. If false, smaller values are assumged to be directionally better than larger one. If None, assumes nothing.

Returns: True if greater is better, False if smaller is better, otherwise None.

abstract metric() → str

Returns: Metric name (e.g. rouge1).

abstract name() → str

Returns the fully qualified name of the metric (e.g. rouge1__prediction__target).

Returns: Metric name.

abstract type() → Type

Returns the type of the metric (e.g. float)

Returns: Metric type.

class dbnl.eval.metrics.RougeScoreType(value, 
                                       names=None, *, 
                                       module=None, 
                                       qualname=None, 
                                       type=None, 
                                       start=1, 
                                       boundary=None)

dbnl.eval.metrics.answer_quality_llm_accuracy(input: str, context: str, prediction: str, eval_llm_client: LLMClient) → Metric

Computes the accuracy of the answer by evaluating the accuracy score of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_accuracy available in dbnl.eval.metrics.prompts.

Parameters:
- input – input column name
- context – context column name
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: accuracy metric

dbnl.eval.metrics.answer_quality_llm_answer_correctness(input: str, prediction: str, target: str, eval_llm_client: LLMClient) → Metric

Returns the answer correctness metric.

This metric is generated by an LLM using a specific prompt named llm_answer_correctness available in dbnl.eval.metrics.prompts.

Parameters:
- input – input column name
- prediction – prediction column name
- target – target column name
- eval_llm_client – eval LLM client
Returns: answer correctness metric

dbnl.eval.metrics.answer_quality_llm_answer_similarity(input: str, prediction: str, target: str, eval_llm_client: LLMClient) → Metric

Returns answer similarity metric.

This metric is generated by an LLM using a specific prompt named llm_answer_similarity available in dbnl.eval.metrics.prompts.

Parameters:
- input – input column name
- prediction – prediction column name
- target – target column name
- eval_llm_client – eval_llm_client
Returns: answer similarity metric

dbnl.eval.metrics.answer_quality_llm_coherence(prediction: str, eval_llm_client: LLMClient) → Metric

Computes the coherence of the answer by evaluating the coherence score of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_coherence available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: coherence metric

dbnl.eval.metrics.answer_quality_llm_commital(prediction: str, eval_llm_client: LLMClient) → Metric

Computes the commital of the answer by evaluating the commital score of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_commital available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: commital metric

dbnl.eval.metrics.answer_quality_llm_completeness(input: str, prediction: str, eval_llm_client: LLMClient) → Metrics

Computes the completeness of the answer by evaluating the completeness score of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_completeness available in dbnl.eval.metrics.prompts.

Parameters:
- input – input column name
- prediction – prediction column
- eval_llm_client – eval_llm_client
Returns: completeness metric

dbnl.eval.metrics.answer_quality_llm_contextual_relevance(input: str, context: str, eval_llm_client: LLMClient) → Metric

Computes the contextual relevance of the answer by evaluating the contextual relevance score of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_contextual_relevance available in dbnl.eval.metrics.prompts.

Parameters:
- input – input column name
- context – context column name
- eval_llm_client – eval_llm_client
Returns: contextual relevance metric

dbnl.eval.metrics.answer_quality_llm_faithfulness(input: str, context: str, prediction: str, eval_llm_client: LLMClient) → Metric

Returns the faithfulness metric.

This metric is generated by an LLM using a specific prompt named llm_faithfulness available in dbnl.eval.metrics.prompts.

Parameters:
- input – input column name
- context – context column name
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: faithfulness metric

dbnl.eval.metrics.answer_quality_llm_grammar_accuracy(prediction: str, eval_llm_client: LLMClient) → Metric

Computes the grammar accuracy of the answer by evaluating the grammar accuracy score of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_grammar_accuracy available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: grammar accuracy metric

dbnl.eval.metrics.answer_quality_llm_metrics(input: str | None, prediction: str, context: str | None, target: str | None, eval_llm_client: LLMClient) → list[Metric]

Returns a set of metrics which evaluate the quality of the generated answer. This does not include metrics that require a ground truth.

Parameters:
- input – input column name (i.e. question)
- prediction – prediction column name (i.e. generated answer)
- context – context column name (i.e. document or set of documents retrieved)
- eval_llm_client – eval_llm_client
Returns: list of metrics

dbnl.eval.metrics.answer_quality_llm_originality(prediction: str, eval_llm_client: LLMClient) → Metric

Computes the originality of the answer by evaluating the originality score of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_originality available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: originality metric

dbnl.eval.metrics.answer_quality_llm_relevance(input: str, context: str, prediction: str, eval_llm_client: LLMClient) → Metric

Returns relevance metric with context.

This metric is generated by an LLM using a specific prompt named llm_relevance available in dbnl.eval.metrics.prompts.

Parameters:
- input – input column name
- context – context column name
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: answer relevance metric with context

dbnl.eval.metrics.answer_viability_llm_metrics(prediction: str, eval_llm_client: LLMClient) → list[Metric]

Returns a list of metrics relevant for a question and answer task.

Parameters:
- prediction – prediction column name (i.e. generated answer)
- eval_llm_client – eval_llm_client
Returns: list of metrics

dbnl.eval.metrics.answer_viability_llm_reading_complexity(prediction: str, eval_llm_client: LLMClient) → Metric

Computes the reading complexity of the answer by evaluating the reading complexity score of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_reading_complexity available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: reading complexity metric

dbnl.eval.metrics.answer_viability_llm_sentiment_assessment(prediction: str, eval_llm_client: LLMClient) → Metric

Computes the sentiment of the answer by evaluating the sentiment assessment score of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_sentiment_assessment available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: sentiment assessment metric

dbnl.eval.metrics.answer_viability_llm_text_fluency(prediction: str, eval_llm_client: LLMClient) → Metric

Computes the text fluency of the answer by evaluating the perplexity of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_text_fluency available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: text fluency metric

dbnl.eval.metrics.answer_viability_llm_text_toxicity(prediction: str, eval_llm_client: LLMClient) → Metric

Computes the toxicity of the answer by evaluating the toxicity score of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_text_toxicity available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: toxicity metric

dbnl.eval.metrics.automated_readability_index(text_col_name: str) → Metric

Returns the Automated Readability Index metric for the text_col_name column.

Calculates the Automated Readability Index (ARI) for a given text. ARI is a readability metric that estimates the U.S. school grade level necessary to understand the text, based on the number of characters per word and words per sentence.

Parameters: text_col_name – text column name
Returns: automated_readability_index metric:

dbnl.eval.metrics.bleu(prediction: str, target: str) → Metric

Returns the bleu metric between the prediction and target columns.

The BLEU score is a metric for evaluating a generated sentence to a reference sentence. The BLEU score is a number between 0 and 1, where 1 means that the generated sentence is identical to the reference sentence.

Parameters:
- prediction – prediction column name
- target – target column name
Returns: bleu metric

dbnl.eval.metrics.character_count(text_col_name: str) → Metric

Returns the character count metric for the text_col_name column.

Parameters: text_col_name – text column name
Returns: character_count metric

dbnl.eval.metrics.context_hit(ground_truth_document_id: str, retrieved_document_ids: str) → Metric

Returns the context hit metric.

This boolean-valued metric is used to evaluate whether the ground truth document is present in the list of retrieved documents. The context hit metric is 1 if the ground truth document is present in the list of retrieved documents, and 0 otherwise.

Parameters:
- ground_truth_document_id – ground_truth_document_id column name
- retrieved_document_ids – retrieved_document_ids column name
Returns: context hit metric

dbnl.eval.metrics.count_metrics(text_col_name: str) → list[Metric]

Returns a set of metrics relevant for a question and answer task.

Parameters: text_col_name – text column name
Returns: list of metrics

dbnl.eval.metrics.flesch_kincaid_grade(text_col_name: str) → Metric

Returns the Flesch-Kincaid Grade metric for the text_col_name column.

Calculates the Flesch-Kincaid Grade Level for a given text. The Flesch-Kincaid Grade Level is a readability metric that estimates the U.S. school grade level required to understand the text. It is based on the average number of syllables per word and words per sentence.

Parameters: text_col_name – text column name
Returns: flesch_kincaid_grade metric

dbnl.eval.metrics.ground_truth_non_llm_answer_metrics(prediction: str, target: str) → list[Metric]

Returns a set of metrics relevant for a question and answer task.

Parameters:
- prediction – prediction column name (i.e. generated answer)
- target – target column name (i.e. expected answer)
Returns: list of metrics

dbnl.eval.metrics.ground_truth_non_llm_retrieval_metrics(ground_truth_document_id: str, retrieved_document_ids: str) → list[Metric]

Returns a set of metrics relevant for a question and answer task.

Parameters:
- ground_truth_document_id – ground_truth_document_id column name
- retrieved_document_ids – retrieved_document_ids column name
Returns: list of metrics

dbnl.eval.metrics.inner_product_retrieval(ground_truth_document_text: str, top_retrieved_document_text: str, eval_embedding_client: EmbeddingClient) → Metric

Returns the inner product metric between the ground_truth_document_text and top_retrieved_document_text columns.

This metric is used to evaluate the similarity between the ground truth document and the top retrieved document using the inner product of their embeddings. The embedding client is used to retrieve the embeddings for the ground truth document and the top retrieved document. An embedding is a high-dimensional vector representation of a string of text.

Parameters:
- ground_truth_document_text – ground_truth_document_text column name
- top_retrieved_document_text – top_retrieved_document_text column name
- embedding_client – embedding client
Returns: inner product metric

dbnl.eval.metrics.inner_product_target_prediction(prediction: str, target: str, eval_embedding_client: EmbeddingClient) → Metric

Returns the inner product metric between the prediction and target columns.

This metric is used to evaluate the similarity between the prediction and target columns using the inner product of their embeddings. The embedding client is used to retrieve the embeddings for the prediction and target columns. An embedding is a high-dimensional vector representation of a string of text.

Parameters:
- prediction – prediction column name
- target – target column name
- embedding_client – embedding client
Returns: inner product metric

dbnl.eval.metrics.levenshtein(prediction: str, target: str) → Metric

Returns the levenshtein metric between the prediction and target columns.

The Levenshtein distance is a metric for evaluating the similarity between two strings. The Levenshtein distance is an integer value, where 0 means that the two strings are identical, and a higher value returns the number of edits required to transform one string into the other.

Parameters:
- prediction – prediction column name
- target – target column name
Returns: levenshtein metric

dbnl.eval.metrics.mrr(ground_truth_document_id: str, retrieved_document_ids: str) → Metric

Returns the mean reciprocal rank (MRR) metric.

This metric is used to evaluate the quality of a ranked list of documents. The MRR score is a number between 0 and 1, where 1 means that the ground truth document is ranked first in the list. The MRR score is calculated by taking the reciprocal of the rank of the first relevant document in the list.

Parameters:
- ground_truth_document_id – ground_truth_document_id column name
- retrieved_document_ids – retrieved_document_ids column name
Returns: mrr metric

dbnl.eval.metrics.non_llm_non_ground_truth_metrics(prediction: str) → list[Metric]

Returns a set of metrics relevant for a question and answer task.

Parameters: prediction – prediction column name (i.e. generated answer)
Returns: list of metrics

dbnl.eval.metrics.question_and_answer_metrics(prediction: str, target: str | None = None, input: str | None = None, context: str | None = None, ground_truth_document_id: str | None = None, retrieved_document_ids: str | None = None, ground_truth_document_text: str | None = None, top_retrieved_document_text: str | None = None, eval_llm_client: LLMClient | None = None, eval_embedding_client: EmbeddingClient | None = None) → list[Metric]

Returns a set of metrics relevant for a question and answer task.

Parameters:
- prediction – prediction column name (i.e. generated answer)
- target – target column name (i.e. expected answer)
- input – input column name (i.e. question)
- context – context column name (i.e. document or set of documents retrieved)
- ground_truth_document_id – ground_truth_document_id containing the information in the target
- retrieved_document_ids – retrieved_document_ids containing the full context
- ground_truth_document_text – text containing the information in the target (ideal is for this to be the top retrieved document)
- top_retrieved_document_text – text of the top retrieved document
- eval_llm_client – eval_llm_client
- eval_embedding_client – eval_embedding_client
Returns: list of metrics

dbnl.eval.metrics.question_and_answer_metrics_extended(prediction: str, target: str | None = None, input: str | None = None, context: str | None = None, ground_truth_document_id: str | None = None, retrieved_document_ids: str | None = None, ground_truth_document_text: str | None = None, top_retrieved_document_text: str | None = None, eval_llm_client: LLMClient | None = None, eval_embedding_client: EmbeddingClient | None = None) → list[Metric]

Returns a set of all metrics relevant for a question and answer task.

Parameters:
- prediction – prediction column name (i.e. generated answer)
- target – target column name (i.e. expected answer)
- input – input column name (i.e. question)
- context – context column name (i.e. document or set of documents retrieved)
- ground_truth_document_id – ground_truth_document_id containing the information in the target
- retrieved_document_ids – retrieved_document_ids containing the full context
- ground_truth_document_text – text containing the information in the target (ideal is for this to be the top retrieved document)
- top_retrieved_document_text – text of the top retrieved document
- eval_llm_client – eval_llm_client
- eval_embedding_client – eval_embedding_client
Returns: list of metrics

dbnl.eval.metrics.text_metrics(prediction: str, target: str | None = None, eval_llm_client: LLMClient | None = None, eval_embedding_client: EmbeddingClient | None = None) → list[Metric]

Returns a set metrics relevant for generic text applications

Parameters:
- prediction – prediction column name (i.e. generated answer)
- target – target column name (i.e. expected answer)
- eval_llm_client – eval_llm_client
- eval_embedding_client – eval_embedding_client
Returns: list of metrics

dbnl.eval.metrics.rouge1(prediction: str, target: str, score_type: RougeScoreType = RougeScoreType.FMEASURE) → Metric

Returns the rouge1 metric between the prediction and target columns.

ROUGE-1 is a recall-oriented metric that calculates the overlap of unigrams (individual words) between the predicted/generated summary and the reference summary. It measures how many single words from the reference summary appear in the predicted summary. ROUGE-1 focuses on basic word-level similarity and is used to evaluate the content coverage.

Parameters:
- prediction – prediction column name
- target – target column name
Returns: rouge1 metric

dbnl.eval.metrics.rouge2(prediction: str, target: str, score_type: RougeScoreType = RougeScoreType.FMEASURE) → Metric

Returns the rouge2 metric between the prediction and target columns.

ROUGE-2 is a recall-oriented metric that calculates the overlap of bigrams (pairs of words) between the predicted/generated summary and the reference summary. It measures how many pairs of words from the reference summary appear in the predicted summary. ROUGE-2 focuses on word-level similarity and is used to evaluate the content coverage.

Parameters:
- prediction – prediction column name
- target – target column name
Returns: rouge2 metric

dbnl.eval.metrics.rougeL(prediction: str, target: str, score_type: RougeScoreType = RougeScoreType.FMEASURE) → Metric

Returns the rougeL metric between the prediction and target columns.

ROUGE-L is a recall-oriented metric based on the Longest Common Subsequence (LCS) between the reference and generated summaries. It measures how well the generated summary captures the longest sequences of words that appear in the same order in the reference summary. This metric accounts for sentence-level structure and coherence.

Parameters:
- prediction – prediction column name
- target – target column name
Returns: rougeL metric

dbnl.eval.metrics.rougeLsum(prediction: str, target: str, score_type: RougeScoreType = RougeScoreType.FMEASURE) → Metric

Returns the rougeLsum metric between the prediction and target columns.

ROUGE-LSum is a variant of ROUGE-L that applies the Longest Common Subsequence (LCS) at the sentence level for summarization tasks. It evaluates how well the generated summary captures the overall sentence structure and important elements of the reference summary by computing the LCS for each sentence in the document.

Parameters:
- prediction – prediction column name
- target – target column name
Returns: rougeLsum metric

dbnl.eval.metrics.rouge_metrics(prediction: str, target: str) → list[Metric]

Returns all rouge metrics between the prediction and target columns.

Parameters:
- prediction – prediction column name
- target – target column name
Returns: list of rouge metrics

dbnl.eval.metrics.sentence_count(text_col_name: str) → Metric

Returns the sentence count metric for the text_col_name column.

Parameters: text_col_name – text column name
Returns: sentence_count metric

dbnl.eval.metrics.summarization_metrics(prediction: str, target: str | None = None, eval_embedding_client: EmbeddingClient | None = None) → list[Metric]

Returns a set of metrics relevant for a summarization task.

Parameters:
- prediction – prediction column name (i.e. generated summary)
- target – target column name (i.e. expected summary)
Returns: list of metrics

dbnl.eval.metrics.token_count(text_col_name: str) → Metric

Returns the token count metric for the text_col_name column.

A token is a sequence of characters that represents a single unit of meaning, such as a word or punctuation mark. The token count metric calculates the total number of tokens in the text. Different languages may have different tokenization rules. This function is implemented using the nltk library.

Parameters: text_col_name – text column name
Returns: token_count metric

dbnl.eval.metrics.word_count(text_col_name: str) → Metric

Returns the word count metric for the text_col_name column.

Parameters: text_col_name – text column name
Returns: word_count metric

dbnl.eval.metrics.quality_llm_text_similarity(prediction: str target: str, eval_llm_client: LLMClient) → Metric

Computes the similarity of the prediction and target text by evaluating using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_accuracy available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- target - target (expected value) column name
- eval_llm_client – Eval LLM client
Returns: text similarity metric

eval.metrics

Classes and methods in dbnl.eval.metrics.

class dbnl.eval.metrics.Metric

column_schema() → ColumnSchema

Returns the column schema for the metric to be used in a run config.

Returns: _description_
Return type: ColumnSchema

description() → str | None

Returns the description of the metric.

Returns: Description of the metric.

abstract evaluate(df: pd.DataFrame) → pd.Series[Any]

Evaluates the metric over the provided dataframe.

Parameters: df – Input data from which to compute metric.
Returns: Metric values.

abstract expression() → str

Returns the expression representing the metric (e.g. rouge1(prediction, target)).

Returns: Metric expression.

greater_is_better() → bool | None

If true, larger values are assumed to be directionally better than smaller once. If false, smaller values are assumged to be directionally better than larger one. If None, assumes nothing.

Returns: True if greater is better, False if smaller is better, otherwise None.

abstract metric() → str

Returns: Metric name (e.g. rouge1).

abstract name() → str

Returns the fully qualified name of the metric (e.g. rouge1__prediction__target).

Returns: Metric name.

abstract type() → Type

Returns the type of the metric (e.g. float)

Returns: Metric type.

class dbnl.eval.metrics.RougeScoreType(value, 
                                       names=None, *, 
                                       module=None, 
                                       qualname=None, 
                                       type=None, 
                                       start=1, 
                                       boundary=None)

dbnl.eval.metrics.answer_quality_llm_accuracy(input: str, context: str, prediction: str, eval_llm_client: LLMClient) → Metric

Computes the accuracy of the answer by evaluating the accuracy score of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_accuracy available in dbnl.eval.metrics.prompts.

Parameters:
- input – input column name
- context – context column name
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: accuracy metric

dbnl.eval.metrics.answer_quality_llm_answer_correctness(input: str, prediction: str, target: str, eval_llm_client: LLMClient) → Metric

Returns the answer correctness metric.

This metric is generated by an LLM using a specific prompt named llm_answer_correctness available in dbnl.eval.metrics.prompts.

Parameters:
- input – input column name
- prediction – prediction column name
- target – target column name
- eval_llm_client – eval LLM client
Returns: answer correctness metric

dbnl.eval.metrics.answer_quality_llm_answer_similarity(input: str, prediction: str, target: str, eval_llm_client: LLMClient) → Metric

Returns answer similarity metric.

This metric is generated by an LLM using a specific prompt named llm_answer_similarity available in dbnl.eval.metrics.prompts.

Parameters:
- input – input column name
- prediction – prediction column name
- target – target column name
- eval_llm_client – eval_llm_client
Returns: answer similarity metric

dbnl.eval.metrics.answer_quality_llm_coherence(prediction: str, eval_llm_client: LLMClient) → Metric

Computes the coherence of the answer by evaluating the coherence score of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_coherence available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: coherence metric

dbnl.eval.metrics.answer_quality_llm_commital(prediction: str, eval_llm_client: LLMClient) → Metric

Computes the commital of the answer by evaluating the commital score of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_commital available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: commital metric

dbnl.eval.metrics.answer_quality_llm_completeness(input: str, prediction: str, eval_llm_client: LLMClient) → Metrics

Computes the completeness of the answer by evaluating the completeness score of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_completeness available in dbnl.eval.metrics.prompts.

Parameters:
- input – input column name
- prediction – prediction column
- eval_llm_client – eval_llm_client
Returns: completeness metric

dbnl.eval.metrics.answer_quality_llm_contextual_relevance(input: str, context: str, eval_llm_client: LLMClient) → Metric

Computes the contextual relevance of the answer by evaluating the contextual relevance score of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_contextual_relevance available in dbnl.eval.metrics.prompts.

Parameters:
- input – input column name
- context – context column name
- eval_llm_client – eval_llm_client
Returns: contextual relevance metric

dbnl.eval.metrics.answer_quality_llm_faithfulness(input: str, context: str, prediction: str, eval_llm_client: LLMClient) → Metric

Returns the faithfulness metric.

This metric is generated by an LLM using a specific prompt named llm_faithfulness available in dbnl.eval.metrics.prompts.

Parameters:
- input – input column name
- context – context column name
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: faithfulness metric

dbnl.eval.metrics.answer_quality_llm_grammar_accuracy(prediction: str, eval_llm_client: LLMClient) → Metric

Computes the grammar accuracy of the answer by evaluating the grammar accuracy score of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_grammar_accuracy available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: grammar accuracy metric

dbnl.eval.metrics.answer_quality_llm_metrics(input: str | None, prediction: str, context: str | None, target: str | None, eval_llm_client: LLMClient) → list[Metric]

Returns a set of metrics which evaluate the quality of the generated answer. This does not include metrics that require a ground truth.

Parameters:
- input – input column name (i.e. question)
- prediction – prediction column name (i.e. generated answer)
- context – context column name (i.e. document or set of documents retrieved)
- eval_llm_client – eval_llm_client
Returns: list of metrics

dbnl.eval.metrics.answer_quality_llm_originality(prediction: str, eval_llm_client: LLMClient) → Metric

Computes the originality of the answer by evaluating the originality score of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_originality available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: originality metric

dbnl.eval.metrics.answer_quality_llm_relevance(input: str, context: str, prediction: str, eval_llm_client: LLMClient) → Metric

Returns relevance metric with context.

This metric is generated by an LLM using a specific prompt named llm_relevance available in dbnl.eval.metrics.prompts.

Parameters:
- input – input column name
- context – context column name
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: answer relevance metric with context

dbnl.eval.metrics.answer_viability_llm_metrics(prediction: str, eval_llm_client: LLMClient) → list[Metric]

Returns a list of metrics relevant for a question and answer task.

Parameters:
- prediction – prediction column name (i.e. generated answer)
- eval_llm_client – eval_llm_client
Returns: list of metrics

dbnl.eval.metrics.answer_viability_llm_reading_complexity(prediction: str, eval_llm_client: LLMClient) → Metric

Computes the reading complexity of the answer by evaluating the reading complexity score of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_reading_complexity available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: reading complexity metric

dbnl.eval.metrics.answer_viability_llm_sentiment_assessment(prediction: str, eval_llm_client: LLMClient) → Metric

Computes the sentiment of the answer by evaluating the sentiment assessment score of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_sentiment_assessment available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: sentiment assessment metric

dbnl.eval.metrics.answer_viability_llm_text_fluency(prediction: str, eval_llm_client: LLMClient) → Metric

Computes the text fluency of the answer by evaluating the perplexity of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_text_fluency available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: text fluency metric

dbnl.eval.metrics.answer_viability_llm_text_toxicity(prediction: str, eval_llm_client: LLMClient) → Metric

Computes the toxicity of the answer by evaluating the toxicity score of the answer using a language model.

This metric is generated by an LLM using a specific prompt named llm_text_toxicity available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: toxicity metric

dbnl.eval.metrics.automated_readability_index(text_col_name: str) → Metric

Returns the Automated Readability Index metric for the text_col_name column.

Parameters: text_col_name – text column name
Returns: automated_readability_index metric:

dbnl.eval.metrics.bleu(prediction: str, target: str) → Metric

Returns the bleu metric between the prediction and target columns.

Parameters:
- prediction – prediction column name
- target – target column name
Returns: bleu metric

dbnl.eval.metrics.character_count(text_col_name: str) → Metric

Returns the character count metric for the text_col_name column.

Parameters: text_col_name – text column name
Returns: character_count metric

dbnl.eval.metrics.context_hit(ground_truth_document_id: str, retrieved_document_ids: str) → Metric

Returns the context hit metric.

Parameters:
- ground_truth_document_id – ground_truth_document_id column name
- retrieved_document_ids – retrieved_document_ids column name
Returns: context hit metric

dbnl.eval.metrics.count_metrics(text_col_name: str) → list[Metric]

Returns a set of metrics relevant for a question and answer task.

Parameters: text_col_name – text column name
Returns: list of metrics

dbnl.eval.metrics.flesch_kincaid_grade(text_col_name: str) → Metric

Returns the Flesch-Kincaid Grade metric for the text_col_name column.

Parameters: text_col_name – text column name
Returns: flesch_kincaid_grade metric

dbnl.eval.metrics.ground_truth_non_llm_answer_metrics(prediction: str, target: str) → list[Metric]

Returns a set of metrics relevant for a question and answer task.

Parameters:
- prediction – prediction column name (i.e. generated answer)
- target – target column name (i.e. expected answer)
Returns: list of metrics

dbnl.eval.metrics.ground_truth_non_llm_retrieval_metrics(ground_truth_document_id: str, retrieved_document_ids: str) → list[Metric]

Returns a set of metrics relevant for a question and answer task.

Parameters:
- ground_truth_document_id – ground_truth_document_id column name
- retrieved_document_ids – retrieved_document_ids column name
Returns: list of metrics

dbnl.eval.metrics.inner_product_retrieval(ground_truth_document_text: str, top_retrieved_document_text: str, eval_embedding_client: EmbeddingClient) → Metric

Returns the inner product metric between the ground_truth_document_text and top_retrieved_document_text columns.

Parameters:
- ground_truth_document_text – ground_truth_document_text column name
- top_retrieved_document_text – top_retrieved_document_text column name
- embedding_client – embedding client
Returns: inner product metric

dbnl.eval.metrics.inner_product_target_prediction(prediction: str, target: str, eval_embedding_client: EmbeddingClient) → Metric

Returns the inner product metric between the prediction and target columns.

Parameters:
- prediction – prediction column name
- target – target column name
- embedding_client – embedding client
Returns: inner product metric

dbnl.eval.metrics.levenshtein(prediction: str, target: str) → Metric

Returns the levenshtein metric between the prediction and target columns.

Parameters:
- prediction – prediction column name
- target – target column name
Returns: levenshtein metric

dbnl.eval.metrics.mrr(ground_truth_document_id: str, retrieved_document_ids: str) → Metric

Returns the mean reciprocal rank (MRR) metric.

Parameters:
- ground_truth_document_id – ground_truth_document_id column name
- retrieved_document_ids – retrieved_document_ids column name
Returns: mrr metric

dbnl.eval.metrics.non_llm_non_ground_truth_metrics(prediction: str) → list[Metric]

Returns a set of metrics relevant for a question and answer task.

Parameters: prediction – prediction column name (i.e. generated answer)
Returns: list of metrics

dbnl.eval.metrics.question_and_answer_metrics(prediction: str, target: str | None = None, input: str | None = None, context: str | None = None, ground_truth_document_id: str | None = None, retrieved_document_ids: str | None = None, ground_truth_document_text: str | None = None, top_retrieved_document_text: str | None = None, eval_llm_client: LLMClient | None = None, eval_embedding_client: EmbeddingClient | None = None) → list[Metric]

Returns a set of metrics relevant for a question and answer task.

Parameters:
- prediction – prediction column name (i.e. generated answer)
- target – target column name (i.e. expected answer)
- input – input column name (i.e. question)
- context – context column name (i.e. document or set of documents retrieved)
- ground_truth_document_id – ground_truth_document_id containing the information in the target
- retrieved_document_ids – retrieved_document_ids containing the full context
- ground_truth_document_text – text containing the information in the target (ideal is for this to be the top retrieved document)
- top_retrieved_document_text – text of the top retrieved document
- eval_llm_client – eval_llm_client
- eval_embedding_client – eval_embedding_client
Returns: list of metrics

dbnl.eval.metrics.question_and_answer_metrics_extended(prediction: str, target: str | None = None, input: str | None = None, context: str | None = None, ground_truth_document_id: str | None = None, retrieved_document_ids: str | None = None, ground_truth_document_text: str | None = None, top_retrieved_document_text: str | None = None, eval_llm_client: LLMClient | None = None, eval_embedding_client: EmbeddingClient | None = None) → list[Metric]

Returns a set of all metrics relevant for a question and answer task.

Parameters:
- prediction – prediction column name (i.e. generated answer)
- target – target column name (i.e. expected answer)
- input – input column name (i.e. question)
- context – context column name (i.e. document or set of documents retrieved)
- ground_truth_document_id – ground_truth_document_id containing the information in the target
- retrieved_document_ids – retrieved_document_ids containing the full context
- ground_truth_document_text – text containing the information in the target (ideal is for this to be the top retrieved document)
- top_retrieved_document_text – text of the top retrieved document
- eval_llm_client – eval_llm_client
- eval_embedding_client – eval_embedding_client
Returns: list of metrics

dbnl.eval.metrics.text_metrics(prediction: str, target: str | None = None, eval_llm_client: LLMClient | None = None, eval_embedding_client: EmbeddingClient | None = None) → list[Metric]

Returns a set metrics relevant for generic text applications

Parameters:
- prediction – prediction column name (i.e. generated answer)
- target – target column name (i.e. expected answer)
- eval_llm_client – eval_llm_client
- eval_embedding_client – eval_embedding_client
Returns: list of metrics

dbnl.eval.metrics.rouge1(prediction: str, target: str, score_type: RougeScoreType = RougeScoreType.FMEASURE) → Metric

Returns the rouge1 metric between the prediction and target columns.

Parameters:
- prediction – prediction column name
- target – target column name
Returns: rouge1 metric

dbnl.eval.metrics.rouge2(prediction: str, target: str, score_type: RougeScoreType = RougeScoreType.FMEASURE) → Metric

Returns the rouge2 metric between the prediction and target columns.

Parameters:
- prediction – prediction column name
- target – target column name
Returns: rouge2 metric

dbnl.eval.metrics.rougeL(prediction: str, target: str, score_type: RougeScoreType = RougeScoreType.FMEASURE) → Metric

Returns the rougeL metric between the prediction and target columns.

Parameters:
- prediction – prediction column name
- target – target column name
Returns: rougeL metric

dbnl.eval.metrics.rougeLsum(prediction: str, target: str, score_type: RougeScoreType = RougeScoreType.FMEASURE) → Metric

Returns the rougeLsum metric between the prediction and target columns.

Parameters:
- prediction – prediction column name
- target – target column name
Returns: rougeLsum metric

dbnl.eval.metrics.rouge_metrics(prediction: str, target: str) → list[Metric]

Returns all rouge metrics between the prediction and target columns.

Parameters:
- prediction – prediction column name
- target – target column name
Returns: list of rouge metrics

dbnl.eval.metrics.sentence_count(text_col_name: str) → Metric

Returns the sentence count metric for the text_col_name column.

Parameters: text_col_name – text column name
Returns: sentence_count metric

dbnl.eval.metrics.summarization_metrics(prediction: str, target: str | None = None, eval_embedding_client: EmbeddingClient | None = None) → list[Metric]

Returns a set of metrics relevant for a summarization task.

Parameters:
- prediction – prediction column name (i.e. generated summary)
- target – target column name (i.e. expected summary)
Returns: list of metrics

dbnl.eval.metrics.token_count(text_col_name: str) → Metric

Returns the token count metric for the text_col_name column.

Parameters: text_col_name – text column name
Returns: token_count metric

dbnl.eval.metrics.word_count(text_col_name: str) → Metric

Returns the word count metric for the text_col_name column.

Parameters: text_col_name – text column name
Returns: word_count metric

dbnl.eval.metrics.quality_llm_text_similarity(prediction: str target: str, eval_llm_client: LLMClient) → Metric

Computes the similarity of the prediction and target text by evaluating using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_accuracy available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- target - target (expected value) column name
- eval_llm_client – Eval LLM client
Returns: text similarity metric