dbnl.eval.metrics
class dbnl.eval.metrics.Metric
column_schema() → RunSchemaColumnSchemaDict
Returns the column schema for the metric to be used in a run config.
Returns: _description_
component() → str | None
description() → str | None
Returns the description of the metric.
Returns: Description of the metric.
abstract evaluate(df: pd.DataFrame) → pd.Series[Any]
Evaluates the metric over the provided dataframe.
Parameters:df – Input data from which to compute metric.
Returns: Metric values.
abstract expression() → str
Returns the expression representing the metric (e.g. rouge1(prediction, target)).
Returns: Metric expression.
greater_is_better() → bool | None
If true, larger values are assumed to be directionally better than smaller once. If false, smaller values are assumged to be directionally better than larger one. If None, assumes nothing.
Returns: True if greater is better, False if smaller is better, otherwise None.
abstract inputs() → list[str]
Returns the input column names required to compute the metric. :return: Input column names.
abstract metric() → str
Returns the metric name (e.g. rouge1). :return: Metric name.
abstract name() → str
Returns the fully qualified name of the metric (e.g. rouge1__prediction__target).
Returns: Metric name.
run_schema_column() → RunSchemaColumnSchema
Returns the column schema for the metric to be used in a run config.
Returns: _description_
abstract type() → Literal['boolean', 'int', 'long', 'float', 'double', 'string', 'category']
Returns the type of the metric (e.g. float)
Returns: Metric type.
class dbnl.eval.metrics.RougeScoreType(value)
An enumeration.
FMEASURE = 'fmeasure'
PRECISION = 'precision'
RECALL = 'recall'
answer_quality_llm_accuracy
Computes the accuracy of the answer by evaluating the accuracy score of the answer using a language model.
This metric is generated by an LLM using a specific specific prompt named llm_accuracy available in dbnl.eval.metrics.prompts.
Parameters:
input – input column name
context – context column name
prediction – prediction column name
eval_llm_client – eval_llm_client
Returns: accuracy metric
answer_quality_llm_answer_correctness
Returns answer correctness metric.
This metric is generated by an LLM using a specific specific prompt named llm_answer_correctness available in dbnl.eval.metrics.prompts.
Parameters:
input – input column name
prediction – prediction column name
target – target column name
eval_llm_client – eval_llm_client
Returns: answer correctness metric
answer_quality_llm_answer_similarity
Returns answer similarity metric.
This metric is generated by an LLM using a specific specific prompt named llm_answer_similarity available in dbnl.eval.metrics.prompts.
Parameters:
input – input column name
prediction – prediction column name
target – target column name
eval_llm_client – eval_llm_client
Returns: answer similarity metric
answer_quality_llm_coherence
Computes the coherence of the answer by evaluating the coherence score of the answer using a language model.
This metric is generated by an LLM using a specific specific prompt named llm_coherence available in dbnl.eval.metrics.prompts.
Parameters:
prediction – prediction column name
eval_llm_client – eval_llm_client
Returns: coherence metric
answer_quality_llm_commital
Computes the commital of the answer by evaluating the commital score of the answer using a language model.
This metric is generated by an LLM using a specific specific prompt named llm_commital available in dbnl.eval.metrics.prompts.
Parameters:
prediction – prediction column name
eval_llm_client – eval_llm_client
Returns: commital metric
answer_quality_llm_completeness
Computes the completeness of the answer by evaluating the completeness score of the answer using a language model.
This metric is generated by an LLM using a specific specific prompt named llm_completeness available in dbnl.eval.metrics.prompts.
Parameters:
input – input column name
prediction – prediction column
eval_llm_client – eval_llm_client
Returns: completeness metric
answer_quality_llm_contextual_relevance
Computes the contextual relevance of the answer by evaluating the contextual relevance score of the answer using a language model.
This metric is generated by an LLM using a specific specific prompt named llm_contextual_relevance available in dbnl.eval.metrics.prompts.
Parameters:
input – input column name
context – context column name
eval_llm_client – eval_llm_client
Returns: contextual relevance metric
answer_quality_llm_faithfulness
Returns faithfulness metric.
This metric is generated by an LLM using a specific specific prompt named llm_faithfulness available in dbnl.eval.metrics.prompts.
Parameters:
input – input column name
context – context column name
prediction – prediction column name
eval_llm_client – eval_llm_client
Returns: faithfulness metric
answer_quality_llm_grammar_accuracy
Computes the grammar accuracy of the answer by evaluating the grammar accuracy score of the answer using a language model.
This metric is generated by an LLM using a specific specific prompt named llm_grammar_accuracy available in dbnl.eval.metrics.prompts.
Parameters:
prediction – prediction column name
eval_llm_client – eval_llm_client
Returns: grammar accuracy metric
answer_quality_llm_metrics
Returns a set of metrics which evaluate the quality of the generated answer. This does not include metrics that require a ground truth.
Parameters:
input – input column name (i.e. question)
prediction – prediction column name (i.e. generated answer)
context – context column name (i.e. document or set of documents retrieved)
eval_llm_client – eval_llm_client
Returns: list of metrics
answer_quality_llm_originality
Computes the originality of the answer by evaluating the originality score of the answer using a language model.
This metric is generated by an LLM using a specific specific prompt named llm_originality available in dbnl.eval.metrics.prompts.
Parameters:
prediction – prediction column name
eval_llm_client – eval_llm_client
Returns: originality metric
answer_quality_llm_relevance
Returns relevance metric with context.
This metric is generated by an LLM using a specific specific prompt named llm_relevance available in dbnl.eval.metrics.prompts.
Parameters:
input – input column name
context – context column name
prediction – prediction column name
eval_llm_client – eval_llm_client
Returns: answer relevance metric with context
answer_viability_llm_metrics
Returns a list of metrics relevant for a question and answer task.
Parameters:
prediction – prediction column name (i.e. generated answer)
eval_llm_client – eval_llm_client
Returns: list of metrics
answer_viability_llm_reading_complexity
Computes the reading complexity of the answer by evaluating the reading complexity score of the answer using a language model.
This metric is generated by an LLM using a specific specific prompt named llm_reading_complexity available in dbnl.eval.metrics.prompts.
Parameters:
prediction – prediction column name
eval_llm_client – eval_llm_client
Returns: reading complexity metric
answer_viability_llm_sentiment_assessment
Computes the sentiment of the answer by evaluating the sentiment assessment score of the answer using a language model.
This metric is generated by an LLM using a specific specific prompt named llm_sentiment_assessment available in dbnl.eval.metrics.prompts.
Parameters:
prediction – prediction column name
eval_llm_client – eval_llm_client
Returns: sentiment assessment metric
answer_viability_llm_text_fluency
Computes the text fluency of the answer by evaluating the perplexity of the answer using a language model.
This metric is generated by an LLM using a specific specific prompt named llm_text_fluency available in dbnl.eval.metrics.prompts.
Parameters:
prediction – prediction column name
eval_llm_client – eval_llm_client
Returns: text fluency metric
answer_viability_llm_text_toxicity
Computes the toxicity of the answer by evaluating the toxicity score of the answer using a language model.
This metric is generated by an LLM using a specific specific prompt named llm_text_toxicity available in dbnl.eval.metrics.prompts.
Parameters:
prediction – prediction column name
eval_llm_client – eval_llm_client
Returns: toxicity metric
automated_readability_index
Returns the Automated Readability Index metric for the text_col_name column.
Calculates the Automated Readability Index (ARI) for a given text. ARI is a readability metric that estimates the U.S. school grade level necessary to understand the text, based on the number of characters per word and words per sentence.
Parameters:text_col_name – text column name
Returns: automated_readability_index metric
bleu
Returns the bleu metric between the prediction and target columns.
The BLEU score is a metric for evaluating a generated sentence to a reference sentence. The BLEU score is a number between 0 and 1, where 1 means that the generated sentence is identical to the reference sentence.
Parameters:
prediction – prediction column name
target – target column name
Returns: bleu metric
character_count
Returns the character count metric for the text_col_name column.
Parameters:text_col_name – text column name
Returns: character_count metric
context_hit
Returns the context hit metric.
This boolean-valued metric is used to evaluate whether the ground truth document is present in the list of retrieved documents. The context hit metric is 1 if the ground truth document is present in the list of retrieved documents, and 0 otherwise.
Parameters:
ground_truth_document_id – ground_truth_document_id column name
retrieved_document_ids – retrieved_document_ids column name
Returns: context hit metric
count_metrics
Returns a set of metrics relevant for a question and answer task.
Parameters:text_col_name – text column name
Returns: list of metrics
flesch_kincaid_grade
Returns the Flesch-Kincaid Grade metric for the text_col_name column.
Calculates the Flesch-Kincaid Grade Level for a given text. The Flesch-Kincaid Grade Level is a readability metric that estimates the U.S. school grade level required to understand the text. It is based on the average number of syllables per word and words per sentence.
Parameters:text_col_name – text column name
Returns: flesch_kincaid_grade metric
ground_truth_non_llm_answer_metrics
Returns a set of metrics relevant for a question and answer task.
Parameters:
prediction – prediction column name (i.e. generated answer)
target – target column name (i.e. expected answer)
Returns: list of metrics
ground_truth_non_llm_retrieval_metrics
Returns a set of metrics relevant for a question and answer task.
Parameters:
ground_truth_document_id – ground_truth_document_id column name
retrieved_document_ids – retrieved_document_ids column name
Returns: list of metrics
inner_product_retrieval
Returns the inner product metric between the ground_truth_document_text and top_retrieved_document_text columns.
This metric is used to evaluate the similarity between the ground truth document and the top retrieved document using the inner product of their embeddings. The embedding client is used to retrieve the embeddings for the ground truth document and the top retrieved document. An embedding is a high-dimensional vector representation of a string of text.
Parameters:
ground_truth_document_text – ground_truth_document_text column name
top_retrieved_document_text – top_retrieved_document_text column name
embedding_client – embedding client
Returns: inner product metric
inner_product_target_prediction
Returns the inner product metric between the prediction and target columns.
This metric is used to evaluate the similarity between the prediction and target columns using the inner product of their embeddings. The embedding client is used to retrieve the embeddings for the prediction and target columns. An embedding is a high-dimensional vector representation of a string of text.
Parameters:
prediction – prediction column name
target – target column name
embedding_client – embedding client
Returns: inner product metric
levenshtein
Returns the levenshtein metric between the prediction and target columns.
The Levenshtein distance is a metric for evaluating the similarity between two strings. The Levenshtein distance is an integer value, where 0 means that the two strings are identical, and a higher value returns the number of edits required to transform one string into the other.
Parameters:
prediction – prediction column name
target – target column name
Returns: levenshtein metric
mrr
Returns the mean reciprocal rank (MRR) metric.
This metric is used to evaluate the quality of a ranked list of documents. The MRR score is a number between 0 and 1, where 1 means that the ground truth document is ranked first in the list. The MRR score is calculated by taking the reciprocal of the rank of the first relevant document in the list.
Parameters:
ground_truth_document_id – ground_truth_document_id column name
retrieved_document_ids – retrieved_document_ids column name
Returns: mrr metric
non_llm_non_ground_truth_metrics
Returns a set of metrics relevant for a question and answer task.
Parameters:prediction – prediction column name (i.e. generated answer)
Returns: list of metrics
quality_llm_text_similarity
Computes the similarty of the prediction and target text by evaluating using a language model.
This metric is generated by an LLM using a specific specific prompt named llm_text_similarity available in dbnl.eval.metrics.prompts.
Parameters:
prediction – prediction column name
eval_llm_client – eval_llm_client
Returns: similarity metric
question_and_answer_metrics
Returns a set of metrics relevant for a question and answer task.
Parameters:
prediction – prediction column name (i.e. generated answer)
target – target column name (i.e. expected answer)
input – input column name (i.e. question)
context – context column name (i.e. document or set of documents retrieved)
ground_truth_document_id – ground_truth_document_id containing the information in the target
retrieved_document_ids – retrieved_document_ids containing the full context
ground_truth_document_text – text containing the information in the target (ideal is for this to be the top retrieved document)
top_retrieved_document_text – text of the top retrieved document
eval_llm_client – eval_llm_client
eval_embedding_client – eval_embedding_client
Returns: list of metrics
question_and_answer_metrics_extended
Returns a set of all metrics relevant for a question and answer task.
Parameters:
prediction – prediction column name (i.e. generated answer)
target – target column name (i.e. expected answer)
input – input column name (i.e. question)
context – context column name (i.e. document or set of documents retrieved)
ground_truth_document_id – ground_truth_document_id containing the information in the target
retrieved_document_ids – retrieved_document_ids containing the full context
ground_truth_document_text – text containing the information in the target (ideal is for this to be the top retrieved document)
top_retrieved_document_text – text of the top retrieved document
eval_llm_client – eval_llm_client
eval_embedding_client – eval_embedding_client
Returns: list of metrics
rouge1
Returns the rouge1 metric between the prediction and target columns.
ROUGE-1 is a recall-oriented metric that calculates the overlap of unigrams (individual words) between the predicted/generated summary and the reference summary. It measures how many single words from the reference summary appear in the predicted summary. ROUGE-1 focuses on basic word-level similarity and is used to evaluate the content coverage.
Parameters:
prediction – prediction column name
target – target column name
Returns: rouge1 metric
rouge2
Returns the rouge2 metric between the prediction and target columns.
ROUGE-2 is a recall-oriented metric that calculates the overlap of bigrams (pairs of words) between the predicted/generated summary and the reference summary. It measures how many pairs of words from the reference summary appear in the predicted summary. ROUGE-2 focuses on word-level similarity and is used to evaluate the content coverage.
Parameters:
prediction – prediction column name
target – target column name
Returns: rouge2 metric
rougeL
Returns the rougeL metric between the prediction and target columns.
ROUGE-L is a recall-oriented metric based on the Longest Common Subsequence (LCS) between the reference and generated summaries. It measures how well the generated summary captures the longest sequences of words that appear in the same order in the reference summary. This metric accounts for sentence-level structure and coherence.
Parameters:
prediction – prediction column name
target – target column name
Returns: rougeL metric
rougeLsum
Returns the rougeLsum metric between the prediction and target columns.
ROUGE-LSum is a variant of ROUGE-L that applies the Longest Common Subsequence (LCS) at the sentence level for summarization tasks. It evaluates how well the generated summary captures the overall sentence structure and important elements of the reference summary by computing the LCS for each sentence in the document.
Parameters:
prediction – prediction column name
target – target column name
Returns: rougeLsum metric
rouge_metrics
Returns all rouge metrics between the prediction and target columns.
Parameters:
prediction – prediction column name
target – target column name
Returns: list of rouge metrics
sentence_count
Returns the sentence count metric for the text_col_name column.
Parameters:text_col_name – text column name
Returns: sentence_count metric
summarization_metrics
Returns a set of metrics relevant for a summarization task.
Parameters:
prediction – prediction column name (i.e. generated summary)
target – target column name (i.e. expected summary)
Returns: list of metrics
text_metrics
Returns a set of metrics relevant for a generic text application
Parameters:
prediction – prediction column name (i.e. generated text)
target – target column name (i.e. expected text)
Returns: list of metrics
text_monitor_metrics
token_count
Returns the token count metric for the text_col_name column.
A token is a sequence of characters that represents a single unit of meaning, such as a word or punctuation mark. The token count metric calculates the total number of tokens in the text. Different languages may have different tokenization rules. This function is implemented using the spaCy library.
Parameters:text_col_name – text column name
Returns: token_count metric
word_count
Returns the word count metric for the text_col_name column.
Parameters:text_col_name – text column name
Returns: word_count metric
Was this helpful?