LogoLogo
AboutBlogLaunch app ↗
v0.20.x
v0.20.x
  • Introduction to AI Testing
  • Welcome to Distributional
  • Motivation
  • What is AI Testing?
  • Stages in the AI Software Development Lifecycle
    • Components of AI Testing
  • Distributional Testing
  • Getting Access to Distributional
  • Learning about Distributional
    • The Distributional Framework
    • Defining Tests in Distributional
      • Automated Production test creation & execution
      • Knowledge-based test creation
      • Comprehensive testing with Distributional
    • Reviewing Test Sessions and Runs in Distributional
      • Reviewing and recalibrating automated Production tests
      • Insights surfaced elsewhere on Distributional
      • Notifications
    • Data in Distributional
      • The flow of data
      • Components and the DAG for root cause analysis
      • Uploading data to Distributional
      • Living in your VPC
  • Using Distributional
    • Getting Started
    • Access
      • Organization and Namespaces
      • Users and Permissions
      • Tokens
    • Data
      • Data Objects
      • Run-Level Data
      • Data Storage Integrations
      • Data Access Controls
    • Testing
      • Creating Tests
        • Test Page
        • Test Drawer Through Shortcuts
        • Test Templates
        • SDK
      • Defining Assertions
      • Production Testing
        • Auto-Test Generation
        • Recalibration
        • Notable Results
        • Dynamic Baseline
      • Testing Strategies
        • Test That a Given Distribution Has Certain Properties
        • Test That Distributions Have the Same Statistics
        • Test That Columns Are Similarly Distributed
        • Test That Specific Results Have Matching Behavior
        • Test That Distributions Are Not the Same
      • Executing Tests
        • Manually Running Tests Via UI
        • Executing Tests Via SDK
      • Reviewing Tests
      • Using Filters
        • Filters in the Compare Page
        • Filters in Tests
    • Python SDK
      • Quick Start
      • Functions
        • login
        • Project
          • create_project
          • copy_project
          • export_project_as_json
          • get_project
          • get_or_create_project
          • import_project_from_json
        • Run Config
          • create_run_config
          • get_latest_run_config
          • get_run_config
          • get_run_config_from_latest_run
        • Run Results
          • get_column_results
          • get_scalar_results
          • get_results
          • report_column_results
          • report_scalar_results
          • report_results
        • Run
          • close_run
          • create_run
          • get_run
          • report_run_with_results
        • Baseline
          • create_run_query
          • get_run_query
          • set_run_as_baseline
          • set_run_query_as_baseline
        • Test Session
          • create_test_session
      • Objects
        • Project
        • RunConfig
        • Run
        • RunQuery
        • TestSession
        • TestRecalibrationSession
        • TestGenerationSession
        • ResultData
      • Experimental Functions
        • create_test
        • get_tests
        • get_test_sessions
        • wait_for_test_session
        • get_or_create_tag
        • prepare_incomplete_test_spec_payload
        • create_test_recalibration_session
        • wait_for_test_recalibration_session
        • create_test_generation_session
        • wait_for_test_generation_session
      • Eval Module
        • Quick Start
        • Application Metric Sets
        • How-To / FAQ
        • LLM-as-judge and Embedding Metrics
        • RAG / Question Answer Example
        • Eval Module Functions
          • Index of functions
          • eval
          • eval.metrics
    • Notifications
    • Release Notes
  • Tutorials
    • Instructions
    • Hello World (Sentiment Classifier)
    • Trading Strategy
    • LLM Text Summarization
      • Setting the Scene
      • Prompt Engineering
      • Integration testing for text summarization
      • Practical considerations
Powered by GitBook

© 2025 Distributional, Inc. All Rights Reserved.

On this page

Was this helpful?

Export as PDF
  1. Using Distributional
  2. Python SDK
  3. Eval Module

Quick Start

PreviousEval ModuleNextApplication Metric Sets

Was this helpful?

To use dbnl.eval, you will need to install the extra 'eval' package as described in .

  1. Create a client to power LLM-as-judge text metrics [optional]

  2. Generate a list of metrics suitable for comparing text_A to reference text_B

  3. Use dbnl.eval to evaluate to compute the list metrics.

  4. Publish the augmented dataframe and new metric quantities to DBNL

import dbnl
import os
import pandas as pd
from openai import OpenAI
from dbnl.eval.llm import OpenAILLMClient
from dbnl.eval import evaluate

# 1. create client to power LLM-as-judge metrics
base_oai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
oai_client = OpenAILLMClient.from_existing_client(base_oai_client, llm_model="gpt-3.5-turbo-0125")

eval_df = pd.DataFrame(
    [
        { "prediction":"France has no capital",
          "ground_truth": "The capital of France is Paris",},
        { "prediction":"The capital of France is Toronto",
          "ground_truth": "The capital of France is Paris",},
        { "prediction":"Paris is the capital",
          "ground_truth": "The capital of France is Paris",},
    ] * 4
)

# 2. get text metrics that use target (ground_truth) and LLM-as-judge metrics
text_metrics = dbnl.eval.metrics.text_metrics(
    prediction="prediction", target="ground_truth", eval_llm_client=oai_client
)
# 3. run text metrics that use target (ground_truth) and LLM-as-judge metrics
aug_eval_df = evaluate(eval_df, text_metrics)

# 4. publish to DBNL
dbnl.login(api_token=os.environ["DBNL_API_TOKEN"])
project = dbnl.get_or_create_project(name="DEAL_testing")
cols = dbnl.experimental.get_column_schemas_from_dataframe(aug_eval_df)
run_config = dbnl.create_run_config(project=project, columns=cols)
run = dbnl.create_run(project=project, run_config=run_config)
dbnl.report_results(run=run, data=aug_eval_df)
dbnl.close_run(run=run)

You can inspect a subset of the the aug_eval_df rows and for example, one of the columns created by one of the metrics in the text_metrics list : llm_text_similarity_v0

idx
prediction
ground_truth
llm_text_similarity_v0__prediction__ground_truth

0

France has no capital

The capital of France is Paris

1

1

The capital of France is Toronto

The capital of France is Paris

1

2

Paris is the capital

The capital of France is Paris

5

The values of llm_text_similarity_v0qualitatively match our expectations on semantic similarity between the prediction and ground_truth

def evaluate(df: pd.DataFrame, metrics: Sequence[Metric], inplace: bool = False) -> pd.DataFrame:
    """
    Evaluates a set of metrics on a dataframe, returning an augmented dataframe.

    :param df: input dataframe
    :param metrics: metrics to compute
    :param inplace: whether to modify the input dataframe in place
    :return: input dataframe augmented with metrics
    """

The column names of the metrics in the returned dataframe include the metric name and the columns that were used in that metrics computation

For example the metric named llm_text_similarity_v0 becomes llm_text_similarity_v0__prediction__ground_truth because it takes as input both the column named prediction and the column named ground_truth

The call to takes a dataframe and metric list as input and returns a dataframe with extra columns. Each new column holds the value of a metric computation for that row

these instructions
evaluate()