Application Metric Sets

The metric set helpers return an adaptive list of metrics, relevant to the application type. See the dbnl.eval.metrics reference for details on all the metric functions available in the eval SDK.

text_metrics()

Basic metrics for generic text comparison and monitoring

  • token_count

  • word_count

  • flesch_kincaid_grade

  • automated_readability_index

  • bleu

  • levenshtein

  • rouge1

  • rouge2

  • rougeL

  • rougeLsum

  • llm_text_toxicity_v0

  • llm_sentiment_assessment_v0

  • llm_reading_complexity_v0

  • llm_grammar_accuracy_v0

  • inner_product

  • llm_text_similarity_v0

question_and_answer_metrics()

Basic metrics for RAG / question answering

  • llm_accuracy_v0

  • llm_completeness_v0

  • answer_similarity_v0

  • faithfulness_v0

  • mrr

  • context_hit

The metric set helpers are adaptive in that :

  1. The metrics returned encode which columns of the dataframe are input to the metric computation e.g., rougeL_prediction__ground_truth is the rougeL metric run with both the column named prediction and the column named ground_truth as input

  2. The metrics returned support any additional optional column info and LLM-as-judge or embedding model clients. If any of this optional info is not provided, the metric set will exclude any metrics that depend on that information

def text_metrics(
    prediction: str,
    target: Optional[str] = None,
    eval_llm_client: Optional[LLMClient] = None,
    eval_embedding_client: Optional[EmbeddingClient] = None,
) -> list[Metric]:
    """
    Returns a set of metrics relevant for a generic text application

    :param prediction: prediction column name (i.e. generated text)
    :param target: target column name (i.e. expected text)
    :return: list of metrics
    """

See the How-To section for concrete examples of adaptive text_metrics() usage

See the RAG example for question_and_answer_metrics() usage

Last updated

Was this helpful?