1 of 56

v0.22.x

Get Started

Overview

Distributional's adaptive testing platform

Distributional is an adaptive testing platform purpose-built for AI applications. It enables you to test AI application data at scale to define, understand, and improve your definition of AI behavior to ensure consistency and stability over time.

Adaptive Testing Workflow

Define Desired Behavior Automatically create a behavioral fingerprint from the app’s runtime logs and any existing development metrics, and generate associated tests to detect changes in that behavior over time.

Integrating Distributional with Other Tools

Getting Access to Distributional

Getting Access

Install the Python SDK

Installing the Python SDK and Accessing Distributional UI

Installing Distributional

1. Latest Stable Release

To install the latest stable release of the dbnl package:

2. Specific Release

To install a specific version (e.g., version 0.22.0):

3. Installing with the `eval` Extra

The dbnl.eval extra includes additional features and requires an external spaCy model.

3.1. Install the Required spaCy Model

To install the required en_core_web_sm pretrained English-language NLP model model for spaCy:

3.2. Install `dbnl` with the `eval` Extra

To install dbnl with evaluation extras:

If you need a specific version with evaluation extras (e.g., version 0.22.0):

4. Accessing the Distributional UI and API token

We recommend setting your API token as an environment variable, see below.

5. Environment Variables

DBNL has three reserved environment variables that it reads in before execution.

Set up for various deployment types

Version Matching Requirements

DBNL provides different versions of the API and SDK. Ensuring compatibility is critical for proper functionality. SDK and API server versions must match major and minor version numbers.

To check your SDK version:

To check your API server version:

Logging into the web app
Clicking the hamburger menu (☰) on the top-left corner
Viewing the version number listed in the footer

Quickstart

Get started with dbnl

This guide walks you through using the Distributional SDK to create your first project, submit two runs and create a test session to compare the behavior of your AI application over time.

Install the dbnl SDK.

First, if you haven't done so already, install the dbnl Python SDK.

pip install dbnl

Export the dbnl environment variables

If you haven't done so already, create a personal access token by going to ☰ > Personal Access Tokens.

Run this command to export the path to the API for your deployment and your personal access token as environment variables.

export DBNL_API_URL="YOUR_DBNL_API_URL"
export DBNL_API_TOKEN="YOUR_PERSONAL_ACCESS_TOKEN"

Create and test two Runs

In a Python script or notebook, add the following code to create two Runs and test them in a Test Session using the first run as the baseline and the second run as the experiment.

import random
from datetime import datetime

import dbnl
import pandas as pd

# Login to dbnl.
dbnl.login()

# Create a new project
now = datetime.now().isoformat()
project = dbnl.get_or_create_project(name=f"quickstart-{now}")

# Submit a first run (baseline)
run1 = dbnl.report_run_with_results(
    project=project,
    display_name="run1",
    column_data=pd.DataFrame([
        {
            "question": f"Is {i} an even or odd number?",
            "answer": random.choice(["even", "odd"]),
        }
        for i in range(20)
    ]).astype({"question": "string", "answer": "category"}),
)

# Submit a second run (experiment)
run2 = dbnl.report_run_with_results(
    project=project,
    display_name="run2",
    column_data=pd.DataFrame([
        {
            "question": f"Is {i} an even or odd number?",
            "answer": random.choice(["even", "odd"]),
        }
        for i in range(20)
    ]).astype({"question": "string", "answer": "category"}),
)

# Create a test session.
dbnl.set_run_as_baseline(run=run1)
dbnl.create_test_session(experiment_run=run2)

View your Test Session results

Congratulations! You ran your first Test Session. You can see the results of the Test Session by navigating to your project in the dbnl app and selecting your test session from the test session table.

By default, a similarity index test is added that tests whether your application has changed between the baseline and experiment run.

Next Steps

Learning about Distributional

Distributional Concepts

Understanding key concepts and their role relative to your app

Adaptive testing for AI applications requires more information than standard deterministic testing. This is because:

AI applications are multi-component systems where changes in one part can affect others in unexpected ways. For instance, a change in your vector database could affect your LLM's responses, or updates to a feature pipeline could impact your machine learning model's predictions.
AI applications are non-stationary, meaning their behavior changes over time even if you don't change the code. This happens because the world they interact with changes - new data comes in, language patterns evolve, and third-party models get updated. A test that passes today might fail tomorrow, not because of a bug, but because the underlying conditions have shifted.
AI applications are non-deterministic. Even with the exact same input, they might produce different outputs each time. Think of asking an LLM the same question twice - you might get two different, but equally valid, responses. This makes it impossible to write traditional tests that expect exact matches.

To account for this, each time you want to measure the behavior of the AI application, you will need to:

Record outcomes at all of the app’s components, and
Push a distribution of inputs through the app to study behavior across the full spectrum of possible app usage.

The inputs, outputs, and outcomes associated with a single app usage are grouped in a Result, with each value in a result described as a Column. The group of results that are used to measure app behavior is called a Run. To determine if an app is behaving as expected, you create a Test, which involves statistical analysis on one or more runs. When you apply your tests to the runs that you want to study, you create a Test Session, which is a permanent record of the behavior of an app at a given time.

Why We Test Data Distributions

Adaptive testing requires a very different approach than traditional software testing. The goal of adaptive testing is to enable teams to define a steady baseline state for any AI application, and through testing, confirm that it maintains steady state, and where it deviates, figure out what needs to evolve or be fixed to reach steady state once again. This process needs to be discoverable, logged, organized, consistent, integrated and scalable.

Testing AI applications needs to be fundamentally reimagined to include statistical tests on distributions of quantities to detect meaningful shifts that warrant deeper investigation.

Distributions > Summary Statistics: Instead of only looking at summary statistics (e.g. mean, median, P90), we need to analyze distributions of metrics, over time. This accounts for the inherent variability in AI systems while maintaining statistical rigor.

Why is this useful? Imagine you have an application that contains an LLM and you want to make sure that the latency of the LLM remains low and consistent across different types of queries. With a traditional monitoring tool, you might be able to easily monitor P90 and P50 values for latency. P50 represents the latency value below which 50% of the requests fall and will give you a sense of the typical (median) response time that users can expect from the system. However, the P50 value for a normal distribution and bimodal distribution can be the same value, even though the shape of the distribution is meaningfully different. This can hide significant (usage-based or system-based) changes in the application that affect the distribution of the latency scores. If you don’t examine the distribution, these changes go unseen.

Consider a scenario where the distribution of LLM latency started with a normal distribution, but due to changes in a third-party data API that your app uses to inform the response of the LLM, the latency distribution becomes bimodal, though with the same median (and P90 values) as before. What could cause this? Here’s a practical example of how something like this could happen. The engineering team of the data API organization made an optimization to their API which allows them to return faster responses for a specific subset of high value queries, and routes the remainder of the API calls to a different server which has a slower response rate.

The effect that this has on your application is that now half of your users are experiencing an improvement in latency, and now a large number of users are experiencing “too much” latency and there’s an inconsistent performance experience among users. Solutions to this particular example include modifying the prompt, switching the data provider to a different source, format the information that you send to the API differently or a number of other engineering solutions. If you are not concerned about the shift and can accept the new steady state of the application, you can also choose to not make changes and declare a new acceptable baseline for the latency P50 value.

The Flow of Data

Your data + DBNL testing == insights about your app's behavior

Distributional uses data generated by your AI-powered app to study its behavior and alert you to valuable insights or worrisome trends. The diagram below gives a quick summary of this process:

Each app usage involves input(s), the resulting output(s), and context about that usage
- Example: Input is a question from a user; Output is your app’s answer to that question; Context is the time/day that the question was asked.
As the app is used, you record and store the usage in a data store for later review
- Example: At 2am every morning, an Airflow job parses all of the previous day’s app usages and sends that info to a data store.
When data is moved to your data store, it is also submitted to DBNL for testing.
- Example: The 2am Airflow job is amended to include data augmentation by DBNL Eval and uploading of the resulting Run to trigger automatic app testing.

A Run usually contains many (e.g., dozens or hundreds) rows of inputs + outputs + context, where each row was generated by an app usage. Our insights are statistically derived from the distributions estimated by these rows.

Using Distributional

Projects

What's in a Project?

Creating a Project

From Scratch

You can create a Project either via the UI or the SDK:

Simply click the "Create Project" button from the Project list view.

import dbnl
dbnl.login()

project = dbnl.create_project(
    name="My Project",
    description="This is a very important project."
)

Copying a Project

You can quickly copy an existing Project to get up and running with a new one. This will copy the following items into your new Project:

Test specifications
Test tags
Notification rules

There are a couple of ways to copy a Project.

Exporting and Importing

Any Project can be exported to a JSON file; that JSON file can then be adjusted to your liking and imported as a new Project. This is doable both via the UI and the SDK:

Exporting

To export a Project, simply click the download icon on the Project page, in the header.

This will download the Project's JSON to your computer. There is an example JSON in the expandable section below.

Importing

Once you have a Project JSON, you can edit it as you'd like, and then import it by clicking the "Create Project" button on the Project list and then clicking the "Import from File" tab.

Fill out the name and description, click "Create Project", and you're all set!

Exporting and importing a Project is done easily via the SDK functions export_project_as_json and import_project_from_json.

import dbnl
dbnl.login()

# Export
project_1 = dbnl.get_or_create_project(name="Existing Project")
export_json = dbnl.export_project_as_json(project=proj1)
# Adjust the project values as you'd like. You will need to change the name.
# An example of the JSON structure is in the collapsible section below
export_json["project"]["name"] = "New Project"

# Import
project_2 = dbnl.import_project_from_json(params=export_json)

Sample Project JSON

{
    "project": {
        "name": "My Project",
        "description": "This is my project."
    },
    "notification_rules": [
        {
            "conditions": [
                {
                    "assertion_name": "less_than",
                    "assertion_params": { "other": 0.85 },
                    "query_name": "test_status_percentage_query",
                    "query_params": {
                        "exclude_tag_ids": [],
                        "include_tag_ids": [],
                        "require_tag_ids": [],
                        "statuses": ["PASSED"]
                    }
                }
            ],
            "name": "Alert if passed tests are less than 85%",
            "notification_integration_names": ["Notification channel"],
            "status": "ENABLED",
            "trigger": "test_session.failed"
        }
    ],
    "tags": [
        {
            "name": "my-tag",
            "description" :"This is my tag."
        }
    ],
    "test_specs": [
        {
            "assertion": { "name": "less_than", "params": { "other": 0.5 } },
            "description": "Testing the difference in the example statistic",
            "name": "Gr.0: Non Parametric Difference: Example_Statistic",
            "statistic_inputs": [
                {
                    "select_query_template": {
                        "filter": null,
                        "select": "{EXPERIMENT}.Example_Statistic"
                    }
                },
                {
                    "select_query_template": {
                        "filter": null,
                        "select": "{BASELINE}.Example_Statistic"
                    }
                }
            ],
            "statistic_name": "my_stat",
            "statistic_params": {},
            "tag_names": ["my-tag"]
        }
    ]
}

Copying

You can also just directly copy a given Project. Again, this can be done via the UI or the SDK:

There are two ways to copy a Project from the UI:

In the Project list, after you click "Create Project", you can navigate to the "Copy Existing" tab and choose a Project from the dropdown.

From a Project Page

While viewing a Project, you can click the copy icon in the header to copy it to a new Project.

Copying a Project is done easily via the SDK function copy_project.

import dbnl
dbnl.login()


project_1 = dbnl.get_or_create_project(name="Existing Project")
project_2 = dbnl.copy_project(project=project_1, name="New Project")

Runs

The Run is the core object for recording an application's behavior; when you upload a dataset from usage of your app, it takes the shape of a Run. As such, you can think of a Run as the results from a batch of uses or from a standard example set of uses of your application. When exploring or testing your app's behavior, you will look at the Run in dbnl either in isolation or in comparison to another Run.

What's in a Run?

A Run contains the following:

a table of results where each row holds the data related to a single app usage (e.g. a single input and output along related metadata),
a set of Run-level values, also known as scalars,
structural information about the components of the app and how they relate, and
user-defined metadata for remembering the context of a run.

Your Project will contain many Runs. As you report Runs into your Project, dbnl will build a picture of how your application is behaving, and you will utilize tests to verify that its behavior is appropriate and consistent. Some common usage patterns would be reporting a Run daily for regular checkpoints or reporting a Run each time you deploy a change to your application.

The structure of a Run is defined by its schema. This informs dbnl about what information will be stored in each result (the columns), what Run-level data will be reported (the scalars), and how the application is organized (the components).

Baseline and Experiment Runs

Reporting Runs

The full process of reporting a Run ultimately breaks down into three steps:

Creating the Run, which includes defining its structure and any relevant metadata
Reporting the results of the Run, which include columnar data and scalars
Closing the Run to mark it as complete once reporting is finished

Creating a Run

The important parts of creating a run are providing identifying information — in the form of a name and metadata — and defining the structure of the data you'll be reporting to it. As mentioned in the previous section, this structure is called the Run Schema.

Run Schema

A Run schema defines four aspects of the Run's structure:

Columns (the data each row in your results will contain)
Scalars (any Run-level data you want to report)
Index (which column or columns uniquely identify rows in your results)
Components (functional groups to organize the reported results in the form of a graph)

Columns

Columns are the only required part of a schema and are core to reporting Runs, as they define the shape your results will take. You report your column schema as a list of objects, which contain the following fields:

name: The name of the column
description: A descriptive blurb about what the column is
component: Which part of your application the column belongs to (see Components below)

Example Columns JSON

[
    { 
        "name": "error_type",
        "type": "category",
        "component": "classifier"
    },
    {
        "name": "email",
        "type": "string",
        "description": "raw email text content from source",
        "component": "input"
    },
    { 
        "name": "spam-pred",
        "type": "boolean",
        "component": "classifier"
    },
    {
        "name": "email_id",
        "type": "string",
        "description": "unique id for each email"
    }
]

Scalars

Example Scalars JSON

[
    {
        "name": "model_F1",
        "type": "float",
        "description": "F1 Score",
        "component": "classifier"
    },
    { 
        "name": "model_recall",
        "type": "float",
        "description": "Model Recall",
        "component": "classifier"
    }
]

Index

Using the index field within the schema, you have the ability to designate Unique Identifiers – specific columns which uniquely identify matching results between Runs. Adding this information facilitates more direct comparisons when testing your application's behavior and makes it easier to explore your data.

Example Index JSON

["email_id"]

Components

Example Components JSON

// Each key defines a component, and the corresponding list defines the
// components downstream from it in your DAG
{
    "input": ["classifier"]
    "classifier": [],
}

Note that if you do not provide a schema when you report a run, dbnl will infer one from the structure of the results you've uploaded. You can additionally still provide an index parameter directly to the report_run_with_results function.

Reporting Run Results

Once you've defined the structure of your run, you can upload data to dbnl to report the results of that run. As mentioned above, there are two kinds of results from your run:

The row-level column results (these each represent the data of a single "usage" of your application)
The Run-level scalar results (these represent data that apply to all usages in your Run as a whole)

dbnl expects you to upload your results data in the form of a pandas DataFrame. Note that scalars can be uploaded as a single-row DataFrame or as a dictionary of values.

Example Results

import pandas as pd

column_results = pd.DataFrame({
    "error_type": ["none", "none", "none", "none"],
    "email": [
        "Hello, I am interested in your product. Please send me more information.",
        "Congratulations! You've won a lottery. Click here to claim your prize.",
        "Hi, can we schedule a meeting for next week?",
        "Don't miss out on this limited time offer! Buy now and save 50%."
    ],
    "spam-pred": [False, True, False, True],
    "email_id": ["1", "2", "3", "4"]
})

scalar_results = pd.DataFrame({
    "model_F1": [0.8],
    "model_recall": [0.74]
})
# Above is equalent to:
scalar_results = {
    "model_F1": 0.8,
    "model_recall": 0.74 
}

Closing a Run

Putting it All Together

Now that you understand each step, you can easily integrate all of this into your codebase with a few simple function calls via our SDK:

import dbnl
import pandas as pd
dbnl.login()


proj = dbnl.get_or_create_project(name="My Project")
run_schema = dbnl.create_run_schema(
    columns=[
        {"name": "error_type", "type": "category", "component": "classifier"},
        {"name": "email", "type": "string", "description": "raw email text content from source", "component": "input"},
        {"name": "spam-pred", "type": "boolean", "component": "classifier"},
        {"name": "email_id", "type": "string", "description": "unique id for each email"},
    ],
    scalars=[
        {
            "name": "model_F1",
            "type": "float",
            "description": "F1 Score",
            "component": "classifier"
        },
        { 
            "name": "model_recall",
            "type": "float",
            "description": "Model Recall"
        }
    ],
    index=["email_id"],
    components_dag={
        "input": ["classifier"]
        "classifier": [],
    }
)
# Creates the run, reports results, and closes the run.
run = dbnl.report_run_with_results(
    project=proj,
    display_name="Run 1 of Email Classifier"
    run_schema=run_schema,
    column_data=pd.DataFrame({
        "error_type": ["none", "none", "none", "none"],
        "email": [
            "Hello, I am interested in your product. Please send me more information.",
            "Congratulations! You've won a lottery. Click here to claim your prize.",
            "Hi, can we schedule a meeting for next week?",
            "Don't miss out on this limited time offer! Buy now and save 50%."
        ],
        "spam-pred": [False, True, False, True],
        "email_id": ["1", "2", "3", "4"]
    }),
    scalar_data={
        "model_F1": 0.8,
        "model_recall": 0.74 
    }
)

Setting a Baseline Run

Dynamic Baseline (Run Queries)

Setting a Default Baseline Run

You can set a default Baseline Run to be used in all Test Sessions either via the UI or the SDK. Additionally, you can create a Run Query to make your Baseline Run dynamic for each Test Session.

Metrics

What are Metrics?

Metrics are measurable properties that help quantify specific characteristics of your data. Metrics can be user-defined, by providing a numeric column computed from your source data alongside your application data.

Alternatively, the Distributional SDK offers a comprehensive set of metrics for evaluating various aspects of text and LLM outputs. Using Distributional's methods for computing metrics will enable better data-exploration and application stability monitoring capabilities.

Using Metrics

The SDK provides convenient functions for computing metrics from your data and reporting the results to Distributional:

import dbnl
import dbnl.eval
import pandas as pd

# login to dbnl
dbnl.login()
project = dbnl.create_project(name="Metrics Project")

df = pd.DataFrame(
    {
        "id": [1, 2, 3],
        "question": [
            "What is the meaning of life?",
            "What is the airspeed velocity of an unladen swallow?",
            "What is the capital of Assyria?",
        ],
        "answer": [
            "To be happy and fulfilled.",
            "It's a question of aerodynamics.",
            "Nineveh was the capital of Assyria.",
        ],
        "expected_answer": [
            "42",
            "It's a question of aerodynamics.",
            "Nineveh was the capital of Assyria.",
        ],
    }
)

# Create individual metrics
metrics = [
    dbnl.eval.metrics.token_count("question"),
    dbnl.eval.metrics.word_count("question"),
    dbnl.eval.metrics.rouge1("answer", "expected_answer"),
]

# Compute metrics and report results to Distributional
run = dbnl.eval.report_run_with_results(
    project=project, column_data=df, metrics=metrics
)

Convenience Functions

The SDK includes helper functions for creating common groups of related metrics based on consistent inputs.

import dbnl
import dbnl.eval
import pandas as pd

# login to dbnl
dbnl.login()
project = dbnl.create_project(name="Metrics Project")

df = pd.DataFrame(
    {
        "id": [1, 2, 3],
        "question": [
            "What is the meaning of life?",
            "What is the airspeed velocity of an unladen swallow?",
            "What is the capital of Assyria?",
        ],
        "answer": [
            "To be happy and fulfilled.",
            "It's a question of aerodynamics.",
            "Nineveh was the capital of Assyria.",
        ],
        "expected_answer": [
            "42",
            "It's a question of aerodynamics.",
            "Nineveh was the capital of Assyria.",
        ],
    }
)

# Get standard text evaluation metrics
text_eval_metrics = dbnl.eval.metrics.text_metrics(
    prediction="answer", target="expected_answer"
)

# Get comprehensive QA evaluation metrics
qa_metrics = dbnl.eval.metrics.question_and_answer_metrics(
    prediction="answer",
    target="expected_answer",
    input="question",
)

# Compute metrics and report results to Distributional
run = dbnl.eval.report_run_with_results(
    project=project, column_data=df, metrics=(text_eval_metrics + qa_metrics)
)

Tests

Tests are the key tool within dbnl for asserting the behavior and consistency of Runs. Possible goals during testing can include:

Asserting that your application, holistically or for a chosen column, behaves consistently compared to a baseline.
Asserting that a chosen column meets its minimum desired behavior (e.g., inference throughput);
Asserting that a chosen column has a distribution that roughly matches a baseline reference;

What's in a Test?

At a high level, a Test is a statistic and an assertion. Generally, the statistic aggregates the data in a column or columns, and the assertion tests some truth about that aggregation. This assertion may check the values from a single Run, or it may check how the values in a Run have changed compared to a baseline. Some basic examples:

Assert the 95th percentile of app_latency_ms is less than or equal to 180

Test Spec JSON

{
    "name": "p95_app_latency_ms",
    "description": "Test the 95th percentile of latency in miliseconds",
    "statistic_name": "percentile",
    "statistic_params": {"percentage": 0.95},
    "assertion": {
        "name": "less_than_or_equal_to",
        "params": {
            "other": 180.0,
        },
    },
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.app_latency_ms"
            }
        },
    ],
}

Assert the absolute difference of median of positive_sentiment_score against the baseline is close to 0

Test Spec JSON

{
    "name": "median_sentiment_similar",
    "description": "Test the absolute difference of median on sentiment",
    "statistic_name": "abs_diff_median",
    "statistic_params": {},
    "assertion": {
        "name": "close_to",
        "params": {
            "other": 0.0,
            "tolerance": 0.01,
        },
    },
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.positive_sentiment_score"
            }
        },
        {
            "select_query_template": {
                "select": "{BASELINE}.positive_sentiment_score"
            }
        },
    ],
}

In the next sections, we will explore the objects required for testing alongside the methods for creating tests, running tests, reviewing/analyzing tests, and some best practices.

Creating Tests

As you become more familiar with the behavior of your application, you may want to build on the default App Similarity Index test with tests that you define yourself. Let's walk through that process.

Designing a Test

Context-Driven Test Creation

As you browse the dbnl UI, you will see "+" icons or "+ Add Test" buttons appear. These provide context-aware shortcuts for easily creating relevant tests.

At each of these locations, a test creation drawer will open on the right side of the page with several of the fields pre-populated based on the context of the button, alongside a history of the statistic, if relevant. Here are some of the best places to look for dbnl-assisted test creation:

Key Insights

Column or Metric Details

When inspecting the details of a column or metric from a Test Session, there are several "Add Test" buttons provided to allow you to quickly create a test on a relevant statistic. The Statistic History graph can help guide you on choosing a threshold.

Summary Statistics Table

When viewing a Run, each entry in the summary statistics table can be used to seed creation of a test for that chosen statistic.

And More!

These shortcuts appear in several other places in the UI as well when you are inspecting your Runs and Test Sessions; keep an eye out for the "+"!

Templated Tests

Test templates are macros for basic test patterns recommended by Distributional. It allows the user to quickly create tests from a builder in the UI. Distributional provides five classes of test templates:

From the Test Configuration tab on your Project, click the dropdown next to "Add Test".

Select from one of the five options. A Test Creation drawer will appear and the user can edit the statistic, column, and assertion that they desire. Note that each Test Template has a limited set of statistics that it supports.

Creating Tests Manually (Advanced)

If you have a good idea of what you want to test or just want to explore, you can create tests manually from either the UI or via the Python SDK.

Let's say you are building an Q&A chatbot, and you have a column for the length of your bot's responses, word_count. Perhaps you want to ensure that your bot never outputs more than 100 words; in that case, you'd choose:

The statistic max,
The assertion less than or equal to ,
and the threshold 100.

But what if you're not opinionated about the specific length? You just want to ensure that your app is behaving consistently as it runs and doesn't suddenly start being unusually wordy or terse. dbnl makes it easy to test that as well; you might go with:

The statistic absolute difference of mean,
The assertion less than,
and the threshold 20.

Now you're ready to go and create that test, either via the UI or the SDK:

From your Project, click the "Test Configuration" tab.

Next to the "My Tests" header, you can click "Add Test" to open the test creation page, which will enable you to define your test through the dropdown menu on the left side of the window.

Available Statistics and Assertions

Statistics

Statistic

Description

absolute difference of max

absolute difference of mean

absolute difference of median

absolute difference of min

absolute difference of percentile

Requires percentage as a parameter.

absolute difference of standard deviation

absolute difference of sum

Category Rank Discrepancy

Computes the absolute difference in the proportion of the specified category between the experiment and baseline runs. The category is specified by its rank in the baseline run.

Requires rankas a parameter: can be one of [most_common, second_most_common, not_top_two].

Chi-squared stat, scaled

Kolmogorov-Smirnov stat, scaled

max

mean

median

min

mode

Null Count

Computes the number of None values in a column.

Null Percentage

Computes the fraction of None values in a column.

percentile

Requires percentage as a parameter.

scalar

signed difference of max

signed difference of mean

signed difference of median

signed difference of min

signed difference of percentile

Requires percentage as a parameter.

signed difference of standard deviation

signed difference of sum

standard deviation

sum

Assertions

Assertion

between

between or equal to

close to

equal to

greater than

greater than or equal to

less than

less than or equal to

not equal to

outside

outside or equal to

Running Tests

You can run any tests you've created (or just the default App Similarity Index test) to investigate the behavior of your application.

Running Your Tests

When you run a Test Session, you are running your tests against a given Experiment Run.

Choose a Baseline Run

Create a Test Session

Tests are run within the context of a Test Session, which is effectively just a collection of tests run against an Experiment Run with a Baseline Run. You can create a Test Session, which will immediately run the tests, via the UI or the SDK:

Regardless of how you choose to create your Test Session, you can specify tags to choose a subset of tests to run in that given session. The following options for tags are available:

Include Tags: Only tests with any of these tags will be run
Exclude Tags: Only tests with none of these tags will be run
Required Tags : Only tests with every one of these tags will be run

Reviewing Tests

The Test Sessions section in your Project is a record of all the Test Sessions you've created. You can view a line chart of the pass rate of your Test Sessions over time or view a table with each row representing a Test Session You can click on a point in the chart or a row in the table to navigate to the corresponding Test Session's detail page to dig into what happened within that session.

Test Session Details

When you first open a Test Session's page, you will land on the Summary tab. This tab provides you with summary information about the session such as the App Similarity Index, which tests have failed, and key insights about the session. There are also tabs to see the Similarity Report (more information below) or to view all the test results within the session.

Similarity Indexes

By default, dbnl creates a App Similarity Index test in your project. This tests that the Similarity Index for your application is over 80.

Key Insights

On the Summary tab, you'll notice a list of key insights that dbnl has discovered about your Test Session. The key insights will tell you at a glance which columns or metrics have had the most significant change in your Experiment Run when compared to the baseline. If you are particularly interested in the column or metric going forward, you can quickly add a test for its Similarity Index.

Expanding one of these will allow you to view some additional information such as a history of the Similarity Index for the related column or metric; if you are viewing a metric, it will also tell you the lineage of which columns the metric is derived from.

Similarity Report

The Similarity Report gives you an overview of all the columns your Experiment Run, providing the relevant Similarity Indexes, the ability to quickly create tests from them, and the option to deep-dive into a column. Expanding one of the rows for a column for show you all the metrics calculated for that column, with their own respective Similarity Indexes and details.

If you click on the "See Details" link on any of these rows (or from the Key Insights view), you'll be taken to a view that lets you explore the respective column or metric in detail.

From this view, you can easily compare the changes in the column/metric with graphs and summary statistics. Expanding of the comparison statistics will give you even more information to dig into! Click "Add Test" to quickly create a test on the related statistic.

What Is a Similarity Index?

Similarity Index is a single number between 0 and 100 that quantifies how much your application’s behavior has changed between two runs – a Baseline and an Experiment run. It is Distributional’s core signal for measuring application drift, automatically calculated and available in every Test Session.

A lower score indicates a greater behavioral change in your AI application. Each Similarity Index has accompanying Key Insights with a description to help users understand and act on the behavioral drift that Distributional has detected.

Where You’ll See It in the UI

Test Session Summary Page — App-level Similarity Index, results of failed tests, and Key Insights
Similarity Report Tab — Breakdown of Similarity Indexes by column and metric
Column Details View — Histograms and statistical comparison for specific metrics
Tests View — History of Similarity Index-based test pass/fail over time

Why It Matters

When model behavior changes, you need:

A clear signal that drift occurred
An explanation of what changed
A workflow to debug, test, and act

Similarity Index + Key Insights provides all three.

Example:

An app’s Similarity Index drops from 93 → 46
Key Insight: “Answer similarity has decreased sharply”
Metric: levenshtein__generated_answer__expected_answer

Result: Investigate histograms, set test thresholds, adjust model

Hierarchical Structure

Similarity Index operates at three levels:

Application Level — Aggregates all lower-level scores
Column Level — Individual column-level drift
Metric Level — Fine-grained metric change (e.g., readability, latency, BLEU score)

Each level rolls up into the one above it. You can sort by Similarity Index to find the most impacted parts of your app.

Test Sessions and Thresholds

By default, a new DBNL project comes with an Application-level Similarity Index test:

Threshold: ≥ 80
Failure: Indicates meaningful application behavior change

In the UI:

Passed tests are shown in green
Failed tests are shown in red with diagnostic details

All past test runs can be reviewed in the test history.

Key Insights

Key Insights are human-readable interpretations of Similarity Index changes. They answer:

“What changed, and does it matter?”

Each Key Insight includes:

A plain-language summary: “Distribution substantially drifted to the right”
The associated column/metric
The Similarity Index for that metric
Option to add a test on the spot

Example:

Distribution substantially drifted to the right.

→ Metric: levenshtein__generated_answer__expected_answer

→ Similarity Index: 46

→ Add Test

Insights are prioritized and ordered by impact, helping you triage quickly.

Deep Dive: Column Similarity Details

Clicking into a Key Insight opens a detailed view:

Histogram overlays for experiment vs. baseline
Summary statistics (mean, median, percentile, std dev)
Absolute difference of statistics between runs
Links to add similarity or statistical tests on specific metrics

This helps pinpoint whether drift was due to longer answers, slower responses, or changes in generation fidelity.

Frequently Asked Questions

What’s considered “low” similarity?

Below 80 = significant drift (default failure threshold)

Below 60 = usually signals substantial regression or change

Can I configure the thresholds?

Yes — Similarity Index thresholds can be adjusted, and custom tests can be created at any level (app, column, metric).

Do I need to set anything up to use Similarity Index?

No. For all numeric columns that overlap between Baseline and Experiment runs, and non-numeric columns with defined metrics, this is automatically run.

What columns does Similarity Index apply to?

Only numeric columns and derived metrics (e.g., response time, BLEU, readability). String values are not supported yet.

Example Workflow

Run a test session
Similarity Index < 80 → test fails
Review top-level Key Insights
Click into a metric (e.g., levenshtein__generated_answer__expected_answer)
View distribution shift and statistical breakdown
Add targeted test thresholds to monitor ongoing behavior
Adjust model, prompt, or infrastructure as needed

Users and Permissions

Discover how dbnl manages user permissions through a layered system of organization and namespace roles—like org admin, org reader, namespace admin, writer, and reader.

Users

A user is an individual who can log into a dbnl organization.

Permissions

Permissions are settings that control access to operations on resources within a dbnl organization. Permissions are made up of two components.

Resource: Defines which resource is being controlled by this permission (e.g. projects, users).
Verb: Defines which operations are being controlled by this permission (e.g. read, write).

For example, the projects.read permission controls access to the read operations on the projects resource. It is required to be able to list and view projects.

Roles

A role consists in a set of permissions. Assigning a role to a user gives the user all the permissions associated with the role.

Roles can be assigned at the organization or namespace level. Assigning roles at the namespace level allows for giving users granular access to projects and their related data.

Org Roles

An org role is a role that can be assigned to a user within an organization. Org role permissions apply to resources across all namespaces.

There are two default org roles defined in every organization.

Org admin

The org admin role has read and write permissions for all org level resources making it possible to perform organization management operations such as creating namespaces and assigning users roles.

By default, the first user in an org is assigned the org admin role.

Org reader

The org reader role has read-only permissions to org level resources making it possible to navigate the organization by listing users and namespaces.

By default, all users are assigned the org reader role.

Assigning a User an Org Role

To assign a user an org role, go to ☰ > Settings > Admin > Users, scroll to the relevant user and select the an org role from the dropdown in the Org Role column.

Assigning a user an org role requires having the org admin role.

Namespace Roles

A namespace role is a role that can be assigned to a user within a namespace. Namespace role permissions only apply to resources defined within the namespace in which the role is assigned.

There are three default namespace roles defined in every organization.

Namespace admin

The namespace admin role has read and write permissions for all namespace level resources within a namespace making it possible to perform namespace management operations such as assigning users roles within a namespace.

By default, the creator of a namespace is assigned the namespace admin role in that namespace.

Namespace writer

The namespace admin role has read and write permissions for all namespace level resources within a namespace except for those resources and operations related to namespace management such as namespace role assignments.

By default, all users are assigned the namespace writer role in the default namespace.

(Experimental) Namespace reader

The namespace reader role has read-only permissions for all namespace level resources within a namespace.

This is an experimental role that is available through the API, but is not currently fully supported in the UI.

Assigning a User a Namespace Role

To assign a user a namespace role within a namespace, go to ☰ > Settings > Admin > Namespaces, scroll and click on the relevant namespace and then click + Add User.

Assigning a user a namespace role requires having the org admin role or the namespace admin role in that namespace.

Tokens

Tokens are used for programmatic access to the dbnl platform.

Personal Access Tokens

A personal access token is a token that can be used for programmatic access to the dbnl platform through the SDK.

Tokens are not revocable at this time. Please remember to keep your tokens safe.

Permissions

Token permissions are resolved at use time, not creation time. As such, changing the user permissions after creating a personal access token will change the permissions of the personal access token.

Create a Personal Access Token

To create a new personal access token, go to ☰ > Personal Access Tokens and click Create Token.

Platform

Sandbox

Instructions for managing a dbnl Sandbox deployment.

The dbnl sandbox deployment bundles all of the dbnl services and dependencies into a single self-contained Docker container. This container replicates a full scale dbnl deployment by creating a Kubernetes cluster in the container and using Helm to deploy the dbnl platform and its dependencies (postgresql, redis and minio).

The sandbox deployment is not suitable for production environments.

Requirements

The sandbox container needs access to the following two registries to pull the containers for the dbnl platform and its dependencies.
- us-docker.pkg.dev
- docker.io
The sandbox container needs sufficient memory and disk space to schedule the k3d cluster and the containers for the dbnl platform and its dependencies.

Registry Credentials

Usage

Although the sandbox image can be deployed manually using Docker, we recommend using the dbnl CLI to manage the sandbox container. For more details on the sandbox CLI options, run:

$ dbnl sandbox --help

Start the Sandbox

To start the dbnl Sandbox, run:

$ dbnl sandbox start -p ${REGISTRY_PASSWORD}

This will start the sandbox in a Docker container named dbnl-sandbox. It will also create a Docker volume of the same name to persist data beyond the lifetime of the sandbox container.

Stop the Sandbox

To stop the dbnl sandbox, run:

$ dbnl sandbox stop

This will stop and remove the sandbox container. It does not remove the Docker volume and the next time the sandbox is started, it will remount the existing volume, persisting the data beyond the lifetime of the Sandbox container.

Get Sandbox Status

To get the status of the dbnl sandbox, run:

$ dbnl sandbox status

Get Sandbox Logs

To tail the dbnl sandbox logs, run:

$ dbnl sandbox logs

Execute Command in Sandbox

To execute a command in the dbnl sandbox, run:

$ dbnl sandbox exec [COMMAND]

This will execute COMMAND within the dbnl sandbox container. This is a useful tool for debugging the state of the containers running within the sandbox containers. For example:

To get a list of all Kubernetes resources, run:

$ dbnl sandbox exec kubectl get all

To get the logs for a particular pod, run:

$ dbnl sandbox exec kubectl logs [POD]

Delete Sandbox Data

This is an irreversible action. All the sandbox data will be lost forever.

To delete the sandbox data, run:

$ dbnl sandbox delete

Authentication

The sandbox deployment uses username and password authentication with a single user. The user credentials are:

Username: admin
Password: password

Storage

The sandbox persists data in a Docker volume named dbnl-sandbox. This volume is persisted even if the sandbox is stopped, making it possible to later resume the sandbox without losing data.

Remote Sandbox

If deploying and hosting the sandbox on a remote host, such as on EC2 or Compute Engine, the sandbox --base-url option needs to be set on start.

For example, if hosting the sandbox on http://example.com:8080, the sandbox needs to be started with:

$ dbnl sandbox start --base-url http://example.com:8080

Currently, the sandbox does not support being hosted from a subpath (e.g. http://example.com:8080/dbnl) or being served from a different port. If those are required, we recommend using a reverse proxy.

Architecture

An overview of the architecture for the dbnl platform

The dbnl platform architecture consists of a set of services packaged as Docker images and a set of standard infrastructure components.

Infrastructure

The dbnl platform requires the following infrastructure:

A Kubernetes cluster to host the dbnl platform services.
A PostgreSQL database to store metadata.
An object store bucket to store raw data.
A Redis database to serve as a messaging queue.
A load balancer to route traffic to the API or UI service.

Services

The dbnl platform consists in three core services:

The API service (api-srv) serves the dbnl API and orchestrates work across the dbnl platform.
The worker service (worker-srv) processes async jobs scheduled by the API service.
The UI service (ui-srv) serves the dbnl UI assets.

Deployment

Instructions for self-hosted deployment options

There are two main options to deploy the dbnl platform as a self-hosted deployment:

Helm chart: The dbnl platform can be deployed using a Helm chart to existing infrastructure provisioned by the customer.
Terraform module: The dbnl platform can be deployed using a Terraform module on infrastructure provisioned by the module alongside the platform. This is options is supported on AWS and GCP.

Which option to choose depends on your situation. The Helm chart provides maximum flexibility, allowing users to provision their infrastructure using their own processes, while the Terraform module provides maximum simplicity, reducing the installation to single Terraform command.

Terraform Module

Terraform module installation instructions

The Terraform module option provides maximum simplicity. It provisions all the required infrastructure and permissions in your cloud provider of choice before deploying the dbnl platform Helm chart, removing the need to provision any infrastructure or permission separately.

Prerequisites

The following prerequisite steps are required before starting the Terraform module installation.

Configuration

To configure the Terraform module, you will need:

A domain name to host the dbnl platform (e.g. dbnl.example.com).

An RSA key pair can be generated with:

Requirements

On the environment from which you are planning to install the module, you will need to:

Infrastructure

At a minimum, the user performing the installation needs to be able to provision the following infrastructure:

Soon.

Installation

Steps

The steps to install the Terraform module using the Terraform CLI are as follows:

Create a dbnl folder and change to it.

Create a modules folder and copy the terraform module to it.

Create a variables.tf file.

Create a main.tf file.

Create a dbnl.tfvars file.

Initialize the Terraform module.

Apply the Terraform module.

Soon.

Options

For more details on all the installation options, see the Terraform module README file and examples folder.

Networking

List of networking requirements

Ingress

Requirements

The dbnl platform needs to be hosted on a domain or subdomain (e.g. dbnl-example.com or dbnl.example.com). It cannot be hosted on a subpath.

HTTPS/SSL

It is recommended that the dbnl platform be served over HTTPS. Support for SSL termination at the load balancer is included.

Egress

Requirements

Currently, the dbnl platform cannot run in an air-gapped environment and requires a few URLs to be accessible via egress.

Artifacts Registry

Required to fetch the dbnl paltform artifacts such as the Helm chart and Docker images.

https://us-docker.pkg.dev/dbnlai/

Object Store

Required for services to access the object store.

https://{BUCKET}.s3.amazonaws.com/ (if using S3)
https://storage.googleapis.com/{BUCKET} (if using GCS)

OIDC

Required to validate OIDC tokens.

https://login.microsoftonline.com/{APP_ID}/v2.0/ (if using Microsoft EntraID)
https://{ACCOUNT}.okta.com/ (if using Okta)

Integrations

Required to use some integrations.

https://events.pagerduty.com/v2/enqueue (if using PagerDuty)
https://hooks.slack.com/services/ (if using Slack)

OIDC Authentication

OIDC configuration options

Configuration

OIDC can be configured using the following options in the dbnl Helm chart or Terraform module:

audience
clientId
issuer
scopes

Instructions on how to get those options for each provider can be found below.

Reference

Query Language

An overview of the dbnl Query Language

The dbnl Query Language is a SQL-like language that allows for querying data in runs for the purpose of drawing visualizations, defining metrics or evaluating tests.

Expressions

An expression is a combination of literals, values, operators, and functions. Expressions can evaluate to scalar or columnar values depending on their types and inputs. There are three types of expressions that can be composed into arbitrarily complex expressions.

Literal Expressions

Literal expressions are constant-valued expressions.

Column and Scalar Expressions

Column and scalar expressions are references to columns or scalar values in a run. They use dot-notation to reference a column or scalar within a run.

For example, a column named score in a run with id run_1234 can be referenced with the expression:

Function Expressions

Function expressions are functions evaluated over zero or more other expressions. They make it possible to compose simple expressions into arbitrarily complex expressions.

For example, the word_count function can be used to compute the word count of the text column in a run with id run_1234 with the expression:

Operators

Operators are aliases for function expressions that enhance readability and ease of use. Operator precedence is the same as that of most SQL dialect.

Arithmetic operators

Arithmetic operators provide support for basic arithmetic operations.

Comparison operators

Comparison operators provide support for common comparison operations.

Logical operators

Logical operators provide support for boolean comparisons.

Null Semantics

The dbnl Query Language follows the null semantics of most SQL dialect. With a few exception, when a null value is used as an input to a function or operator, the result is null.

One exception to this is boolean functions and operators where ternary logic is used similar to most SQL dialects.

Functions

abs

Returns the absolute value of the input.

Syntax

abs(expr)

add

Adds the two inputs.

Syntax

add(expr1, expr2)

and

Logical and operation of two or more boolean columns.

Syntax

and(expr1, expr2)

automated_readability_index

Returns the ARI (Automated Readability Index) which outputs a number that approximates the grade level needed to comprehend the text. For example if the ARI is 6.5, then the grade level to comprehend the text is 6th to 7th grade.

Syntax

automated_readability_index(expr)

bleu

Computes the BLEU score between two columns.

Syntax

bleu(expr1, expr2)

character_count

Returns the number of characters in a text column.

Syntax

character_count(expr)

Aliases

num_chars

divide

Divides the two inputs.

Syntax

divide(expr1, expr2)

equal_to

Computes the element-wise equal to comparison of two columns.

Syntax

equal_to(expr1, expr2)

Aliases

eq

filter

Filters a column using another column as a mask.

Syntax

filter(expr1, expr2)

flesch_kincaid_grade

Returns the Flesch-Kincaid Grade of the given text. This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document.

Syntax

flesch_kincaid_grade(expr)

greater_than

Computes the element-wise greater than comparison of two columns. input1 > input2

Syntax

greater_than(expr1, expr2)

Aliases

gt

greater_than_or_equal_to

Computes the element-wise greater than or equal to comparison of two columns. input1 >= input2

Syntax

greater_than_or_equal_to(expr1, expr2)

Aliases

gte

is_valid_json

Returns true if the input string is valid json.

Syntax

is_valid_json(expr)

less_than

Computes the element-wise less than comparison of two columns. input1 < input2

Syntax

less_than(expr1, expr2)

Aliases

lt

less_than_or_equal_to

Computes the element-wise less than or equal to comparison of two columns. input1 <= input2

Syntax

less_than_or_equal_to(expr1, expr2)

Aliases

lte

levenshtein

Returns Damerau-Levenshtein distance between two strings.

Syntax

levenshtein(expr1, expr2)

list_has_duplicate

Returns True if the list has duplicated items.

Syntax

list_has_duplicate(expr)

list_len

Returns the length of lists in a list column.

Syntax

list_len(expr)

list_most_common

Most common item in list.

Syntax

list_most_common(expr)

multiply

Multiplies the two inputs.

Syntax

multiply(expr1, expr2)

negate

Returns the negation of the input.

Syntax

negate(expr)

not

Logical not operation of a boolean column.

Syntax

not(expr)

not_equal_to

Computes the element-wise not equal to comparison of two columns.

Syntax

not_equal_to(expr1, expr2)

Aliases

neq

or

Logical or operation of two or more boolean columns.

Syntax

or(expr1, expr2)

rouge1

Returns the rouge1 score between two columns.

Syntax

rouge1(expr1, expr2)

rouge2

Returns the rouge2 score between two columns.

Syntax

rouge2(expr1, expr2)

rougeL

Returns the rougeL score between two columns.

Syntax

rougeL(expr1, expr2)

rougeLsum

Returns the rougeLsum score between two columns.

Syntax

rougeLsum(expr1, expr2)

sentence_count

Returns the number of sentences in a text column.

Syntax

sentence_count(expr)

Aliases

num_sentences

subtract

Subtracts the two inputs.

Syntax

subtract(expr1, expr2)

token_count

Returns the number of tokens in a text column.

Syntax

token_count(expr)

word_count

Returns the number of words in a text column.

Syntax

word_count(expr)

Aliases

num_words

Python SDK

The primary mechanism for submitting data to Distributional is through our Python SDK. This section servers as a reference for the various functionalities available for interacting with dbnl via the SDK.

Example Usage

Versions

dbnl.eval.metrics

class dbnl.eval.metrics.Metric

column_schema() → RunSchemaColumnSchemaDict

Returns the column schema for the metric to be used in a run config.

Returns: _description_

component() → str | None

description() → str | None

Returns the description of the metric.

Returns: Description of the metric.

abstract evaluate(df: pd.DataFrame) → pd.Series[Any]

Evaluates the metric over the provided dataframe.

Parameters:df – Input data from which to compute metric.
Returns: Metric values.

abstract expression() → str

Returns the expression representing the metric (e.g. rouge1(prediction, target)).

Returns: Metric expression.

greater_is_better() → bool | None

If true, larger values are assumed to be directionally better than smaller once. If false, smaller values are assumged to be directionally better than larger one. If None, assumes nothing.

Returns: True if greater is better, False if smaller is better, otherwise None.

abstract inputs() → list[str]

Returns the input column names required to compute the metric. :return: Input column names.

abstract metric() → str

Returns the metric name (e.g. rouge1). :return: Metric name.

abstract name() → str

Returns the fully qualified name of the metric (e.g. rouge1__prediction__target).

Returns: Metric name.

run_schema_column() → RunSchemaColumnSchema

Returns the column schema for the metric to be used in a run config.

Returns: _description_

abstract type() → Literal['boolean', 'int', 'long', 'float', 'double', 'string', 'category']

Returns the type of the metric (e.g. float)

Returns: Metric type.

class dbnl.eval.metrics.RougeScoreType(value)

An enumeration.

FMEASURE = 'fmeasure'

PRECISION = 'precision'

RECALL = 'recall'

answer_quality_llm_accuracy

Computes the accuracy of the answer by evaluating the accuracy score of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_accuracy available in dbnl.eval.metrics.prompts.

Parameters:
- input – input column name
- context – context column name
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: accuracy metric

answer_quality_llm_answer_correctness

Returns answer correctness metric.

This metric is generated by an LLM using a specific specific prompt named llm_answer_correctness available in dbnl.eval.metrics.prompts.

Parameters:
- input – input column name
- prediction – prediction column name
- target – target column name
- eval_llm_client – eval_llm_client
Returns: answer correctness metric

answer_quality_llm_answer_similarity

Returns answer similarity metric.

This metric is generated by an LLM using a specific specific prompt named llm_answer_similarity available in dbnl.eval.metrics.prompts.

Parameters:
- input – input column name
- prediction – prediction column name
- target – target column name
- eval_llm_client – eval_llm_client
Returns: answer similarity metric

answer_quality_llm_coherence

Computes the coherence of the answer by evaluating the coherence score of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_coherence available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: coherence metric

answer_quality_llm_commital

Computes the commital of the answer by evaluating the commital score of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_commital available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: commital metric

answer_quality_llm_completeness

Computes the completeness of the answer by evaluating the completeness score of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_completeness available in dbnl.eval.metrics.prompts.

Parameters:
- input – input column name
- prediction – prediction column
- eval_llm_client – eval_llm_client
Returns: completeness metric

answer_quality_llm_contextual_relevance

Computes the contextual relevance of the answer by evaluating the contextual relevance score of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_contextual_relevance available in dbnl.eval.metrics.prompts.

Parameters:
- input – input column name
- context – context column name
- eval_llm_client – eval_llm_client
Returns: contextual relevance metric

answer_quality_llm_faithfulness

Returns faithfulness metric.

This metric is generated by an LLM using a specific specific prompt named llm_faithfulness available in dbnl.eval.metrics.prompts.

Parameters:
- input – input column name
- context – context column name
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: faithfulness metric

answer_quality_llm_grammar_accuracy

Computes the grammar accuracy of the answer by evaluating the grammar accuracy score of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_grammar_accuracy available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: grammar accuracy metric

answer_quality_llm_metrics

Returns a set of metrics which evaluate the quality of the generated answer. This does not include metrics that require a ground truth.

Parameters:
- input – input column name (i.e. question)
- prediction – prediction column name (i.e. generated answer)
- context – context column name (i.e. document or set of documents retrieved)
- eval_llm_client – eval_llm_client
Returns: list of metrics

answer_quality_llm_originality

Computes the originality of the answer by evaluating the originality score of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_originality available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: originality metric

answer_quality_llm_relevance

Returns relevance metric with context.

This metric is generated by an LLM using a specific specific prompt named llm_relevance available in dbnl.eval.metrics.prompts.

Parameters:
- input – input column name
- context – context column name
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: answer relevance metric with context

answer_viability_llm_metrics

Returns a list of metrics relevant for a question and answer task.

Parameters:
- prediction – prediction column name (i.e. generated answer)
- eval_llm_client – eval_llm_client
Returns: list of metrics

answer_viability_llm_reading_complexity

Computes the reading complexity of the answer by evaluating the reading complexity score of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_reading_complexity available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: reading complexity metric

answer_viability_llm_sentiment_assessment

Computes the sentiment of the answer by evaluating the sentiment assessment score of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_sentiment_assessment available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: sentiment assessment metric

answer_viability_llm_text_fluency

Computes the text fluency of the answer by evaluating the perplexity of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_text_fluency available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: text fluency metric

answer_viability_llm_text_toxicity

Computes the toxicity of the answer by evaluating the toxicity score of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_text_toxicity available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: toxicity metric

automated_readability_index

Returns the Automated Readability Index metric for the text_col_name column.

Calculates the Automated Readability Index (ARI) for a given text. ARI is a readability metric that estimates the U.S. school grade level necessary to understand the text, based on the number of characters per word and words per sentence.

Parameters:text_col_name – text column name
Returns: automated_readability_index metric

bleu

Returns the bleu metric between the prediction and target columns.

The BLEU score is a metric for evaluating a generated sentence to a reference sentence. The BLEU score is a number between 0 and 1, where 1 means that the generated sentence is identical to the reference sentence.

Parameters:
- prediction – prediction column name
- target – target column name
Returns: bleu metric

character_count

Returns the character count metric for the text_col_name column.

Parameters:text_col_name – text column name
Returns: character_count metric

context_hit

Returns the context hit metric.

This boolean-valued metric is used to evaluate whether the ground truth document is present in the list of retrieved documents. The context hit metric is 1 if the ground truth document is present in the list of retrieved documents, and 0 otherwise.

Parameters:
- ground_truth_document_id – ground_truth_document_id column name
- retrieved_document_ids – retrieved_document_ids column name
Returns: context hit metric

count_metrics

Returns a set of metrics relevant for a question and answer task.

Parameters:text_col_name – text column name
Returns: list of metrics

flesch_kincaid_grade

Returns the Flesch-Kincaid Grade metric for the text_col_name column.

Calculates the Flesch-Kincaid Grade Level for a given text. The Flesch-Kincaid Grade Level is a readability metric that estimates the U.S. school grade level required to understand the text. It is based on the average number of syllables per word and words per sentence.

Parameters:text_col_name – text column name
Returns: flesch_kincaid_grade metric

ground_truth_non_llm_answer_metrics

Returns a set of metrics relevant for a question and answer task.

Parameters:
- prediction – prediction column name (i.e. generated answer)
- target – target column name (i.e. expected answer)
Returns: list of metrics

ground_truth_non_llm_retrieval_metrics

Returns a set of metrics relevant for a question and answer task.

Parameters:
- ground_truth_document_id – ground_truth_document_id column name
- retrieved_document_ids – retrieved_document_ids column name
Returns: list of metrics

inner_product_retrieval

Returns the inner product metric between the ground_truth_document_text and top_retrieved_document_text columns.

This metric is used to evaluate the similarity between the ground truth document and the top retrieved document using the inner product of their embeddings. The embedding client is used to retrieve the embeddings for the ground truth document and the top retrieved document. An embedding is a high-dimensional vector representation of a string of text.

Parameters:
- ground_truth_document_text – ground_truth_document_text column name
- top_retrieved_document_text – top_retrieved_document_text column name
- embedding_client – embedding client
Returns: inner product metric

inner_product_target_prediction

Returns the inner product metric between the prediction and target columns.

This metric is used to evaluate the similarity between the prediction and target columns using the inner product of their embeddings. The embedding client is used to retrieve the embeddings for the prediction and target columns. An embedding is a high-dimensional vector representation of a string of text.

Parameters:
- prediction – prediction column name
- target – target column name
- embedding_client – embedding client
Returns: inner product metric

levenshtein

Returns the levenshtein metric between the prediction and target columns.

The Levenshtein distance is a metric for evaluating the similarity between two strings. The Levenshtein distance is an integer value, where 0 means that the two strings are identical, and a higher value returns the number of edits required to transform one string into the other.

Parameters:
- prediction – prediction column name
- target – target column name
Returns: levenshtein metric

mrr

Returns the mean reciprocal rank (MRR) metric.

This metric is used to evaluate the quality of a ranked list of documents. The MRR score is a number between 0 and 1, where 1 means that the ground truth document is ranked first in the list. The MRR score is calculated by taking the reciprocal of the rank of the first relevant document in the list.

Parameters:
- ground_truth_document_id – ground_truth_document_id column name
- retrieved_document_ids – retrieved_document_ids column name
Returns: mrr metric

non_llm_non_ground_truth_metrics

Returns a set of metrics relevant for a question and answer task.

Parameters:prediction – prediction column name (i.e. generated answer)
Returns: list of metrics

quality_llm_text_similarity

Computes the similarty of the prediction and target text by evaluating using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_text_similarity available in dbnl.eval.metrics.prompts.

Parameters:
- prediction – prediction column name
- eval_llm_client – eval_llm_client
Returns: similarity metric

question_and_answer_metrics

Returns a set of metrics relevant for a question and answer task.

Parameters:
- prediction – prediction column name (i.e. generated answer)
- target – target column name (i.e. expected answer)
- input – input column name (i.e. question)
- context – context column name (i.e. document or set of documents retrieved)
- ground_truth_document_id – ground_truth_document_id containing the information in the target
- retrieved_document_ids – retrieved_document_ids containing the full context
- ground_truth_document_text – text containing the information in the target (ideal is for this to be the top retrieved document)
- top_retrieved_document_text – text of the top retrieved document
- eval_llm_client – eval_llm_client
- eval_embedding_client – eval_embedding_client
Returns: list of metrics

question_and_answer_metrics_extended

Returns a set of all metrics relevant for a question and answer task.

Parameters:
- prediction – prediction column name (i.e. generated answer)
- target – target column name (i.e. expected answer)
- input – input column name (i.e. question)
- context – context column name (i.e. document or set of documents retrieved)
- ground_truth_document_id – ground_truth_document_id containing the information in the target
- retrieved_document_ids – retrieved_document_ids containing the full context
- ground_truth_document_text – text containing the information in the target (ideal is for this to be the top retrieved document)
- top_retrieved_document_text – text of the top retrieved document
- eval_llm_client – eval_llm_client
- eval_embedding_client – eval_embedding_client
Returns: list of metrics

rouge1

Returns the rouge1 metric between the prediction and target columns.

ROUGE-1 is a recall-oriented metric that calculates the overlap of unigrams (individual words) between the predicted/generated summary and the reference summary. It measures how many single words from the reference summary appear in the predicted summary. ROUGE-1 focuses on basic word-level similarity and is used to evaluate the content coverage.

Parameters:
- prediction – prediction column name
- target – target column name
Returns: rouge1 metric

rouge2

Returns the rouge2 metric between the prediction and target columns.

ROUGE-2 is a recall-oriented metric that calculates the overlap of bigrams (pairs of words) between the predicted/generated summary and the reference summary. It measures how many pairs of words from the reference summary appear in the predicted summary. ROUGE-2 focuses on word-level similarity and is used to evaluate the content coverage.

Parameters:
- prediction – prediction column name
- target – target column name
Returns: rouge2 metric

rougeL

Returns the rougeL metric between the prediction and target columns.

ROUGE-L is a recall-oriented metric based on the Longest Common Subsequence (LCS) between the reference and generated summaries. It measures how well the generated summary captures the longest sequences of words that appear in the same order in the reference summary. This metric accounts for sentence-level structure and coherence.

Parameters:
- prediction – prediction column name
- target – target column name
Returns: rougeL metric

rougeLsum

Returns the rougeLsum metric between the prediction and target columns.

ROUGE-LSum is a variant of ROUGE-L that applies the Longest Common Subsequence (LCS) at the sentence level for summarization tasks. It evaluates how well the generated summary captures the overall sentence structure and important elements of the reference summary by computing the LCS for each sentence in the document.

Parameters:
- prediction – prediction column name
- target – target column name
Returns: rougeLsum metric

rouge_metrics

Returns all rouge metrics between the prediction and target columns.

Parameters:
- prediction – prediction column name
- target – target column name
Returns: list of rouge metrics

sentence_count

Returns the sentence count metric for the text_col_name column.

Parameters:text_col_name – text column name
Returns: sentence_count metric

summarization_metrics

Returns a set of metrics relevant for a summarization task.

Parameters:
- prediction – prediction column name (i.e. generated summary)
- target – target column name (i.e. expected summary)
Returns: list of metrics

text_metrics

Returns a set of metrics relevant for a generic text application

Parameters:
- prediction – prediction column name (i.e. generated text)
- target – target column name (i.e. expected text)
Returns: list of metrics

text_monitor_metrics

token_count

Returns the token count metric for the text_col_name column.

A token is a sequence of characters that represents a single unit of meaning, such as a word or punctuation mark. The token count metric calculates the total number of tokens in the text. Different languages may have different tokenization rules. This function is implemented using the spaCy library.

Parameters:text_col_name – text column name
Returns: token_count metric

word_count

Returns the word count metric for the text_col_name column.

Parameters:text_col_name – text column name
Returns: word_count metric

dbnl

close_run

Mark the specified dbnl Run status as closed. A closed run is finalized and considered complete. Once a Run is marked as closed, it can no longer be used for reporting Results.

Note that the Run will not be closed immediately. It will transition into a closing state and will be closed in the background. If wait_for_close is set to True, the function will block for up to 3 minutes until the Run is closed.

Parameters:
- run – The Run to be closed
- wait_for_close – If True, the function will block for up to 3 minutes until the Run is closed, defaults to True
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLInputValidationError – Input does not conform to expected format
- DBNLError – Run did not close after waiting for 3 minutes

IMPORTANT

A run must be closed for uploaded results to be shown on the UI.

copy_project

Copy a Project; a convenience method wrapping exporting and importing a project with a new name and description

Parameters:
- project – The project to copy
- name – A name for the new Project
- description – An optional description for the new Project. Description is limited to 255 characters.
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLInputValidationError – Input does not conform to expected format
- DBNLConflictingProjectError – Project with the same name already exists
Returns: The newly created Project

Examples:

create_metric

Create a new DBNL Metric

Parameters:
- project – DBNL Project to create the Metric for
- name – Name for the Metric
- expression_template – Expression template string e.g. token_count({RUN}.question)
- description – Optional description of what computation the metric is performing
- greater_is_better – Flag indicating whether greater values are semantically ‘better’ than lesser values
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLInputValidationError – Input does not conform to expected format
Returns: Created Metric

create_project

Create a new Project

Parameters:
- name – Name for the Project
- description – Description for the DBNL Project, defaults to None. Description is limited to 255 characters.
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLAPIValidationError – DBNL API failed to validate the request
- DBNLConflictingProjectError – Project with the same name already exists
Returns: Project

Examples:

create_run

Create a new Run

Parameters:
- project – The Project this Run is associated with.
- run_schema – The schema for data that will be associated with this run. DBNL will validate data you upload against this schema.
- display_name – An optional display name for the Run, defaults to None. display_name does not have to be unique.
- metadata – Additional key-value pairs you want to track, defaults to None.
- run_config – (Deprecated) Do not use. Use run_schema instead.
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLInputValidationError – Input does not conform to expected format
Returns: Newly created Run

create_run_config

(Deprecated) Please see create_run_schema instead.

Parameters:
- project – DBNL Project this RunConfig is associated to
- columns – List of column schema specs for the uploaded data, required keys name and type, optional key component, description and greater_is_better. type can be int, float, category, boolean, or string. component is a string that indicates the source of the data. e.g. “component” : “sentiment-classifier” or “component” : “fraud-predictor”. Specified components must be present in the components_dag dictionary. greater_is_better is a boolean that indicates if larger values are better than smaller ones. False indicates smaller values are better. None indicates no preference. An example RunConfig columns: columns=[{“name”: “pred_proba”, “type”: “float”, “component”: “fraud-predictor”}, {“name”: “decision”, “type”: “boolean”, “component”: “threshold-decision”}, {“name”: “error_type”, “type”: “category”}]
- scalars –
  List of scalar schema specs for the uploaded data, required keys name and type, optional key component, description and greater_is_better. : type can be int, float, category, boolean, or string. component is a string that indicates the source of the data. e.g. “component” : “sentiment-classifier” or “component” : “fraud-predictor”. Specified components must be present in the components_dag dictionary.
  greater_is_better is a boolean that indicates if larger values are better than smaller ones. False indicates smaller values are better. None indicates no preference.
  An example RunConfig scalars: scalars=[{“name”: “accuracy”, “type”: “float”, “component”: “fraud-predictor”}, {“name”: “error_type”, “type”: “category”}]
- description – Description for the DBNL RunConfig, defaults to None. Description is limited to 255 characters.
- display_name – Display name for the RunConfig, defaults to None. display_name does not have to be unique.
- row_id – List of column names that are the unique identifier, defaults to None.
- components_dag – Optional dictionary representing the DAG of components, defaults to None. eg : {“fraud-predictor”: [‘threshold-decision”], “threshold-decision”: []},
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLInputValidationError – Input does not conform to expected format
Returns: RunConfig with the desired columns schema

create_run_config_from_results

(Deprecated) Please see create_run_schema_from_results instead.

Parameters:
- project – DBNL Project to create the RunConfig for
- column_data – DataFrame with the results for the columns
- scalar_data – Dictionary or DataFrame with the results for the scalars, defaults to None
- description – Description for the RunConfig, defaults to None
- display_name – Display name for the RunConfig, defaults to None
- row_id – List of column names that are the unique identifier, defaults to None
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLInputValidationError – Input does not conform to expected format
Returns: RunConfig with the desired schema for columns and scalars, if provided

create_run_query

Create a new RunQuery for a project to use as a baseline Run. Currently supports key=”offset_from_now” with value as a positive integer, representing the number of runs to go back for the baseline. For example, query={“offset_from_now”: 1} will use the latest run as the baseline, so that each run compares against the previous run.

Parameters:
- project – The Project to create the RunQuery for
- name – A name for the RunQuery
- query – A dict describing how to find a Run dynamically. Currently, only supports “offset_from_now”: int as a key-value pair.
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLInputValidationError – Input does not conform to expected format
Returns: A new dbnl RunQuery, typically used for finding a Dynamic Baseline for a Test Session

Examples:

create_run_schema

Create a new RunSchema

Parameters:
- columns – List of column schema specs for the uploaded data, required keys name and type, optional keys component, description and greater_is_better.
- scalars – List of scalar schema specs for the uploaded data, required keys name and type, optional keys component, description and greater_is_better.
- index – Optional list of column names that are the unique identifier.
- components_dag – Optional dictionary representing the DAG of components.
Returns: The RunSchema

Supported Types

int
float
boolean
string
category
list

Components

The optional component key is for specifying the source of the data column in relationship to the AI/ML app subcomponents. Components are used in visualizing the components DAG.

The components_dag dictionary specifies the topological layout of the AI/ML app. For each key-value pair, the key represents the source component, and the value is a list of the leaf components. The following code snippet describes the DAG shown above.

Examples:

Basic

With `scalars`, `index`, and `components_dag`

create_run_schema_from_results

Create a new RunSchema from the column results, as well as scalar results if provided

Parameters:
- column_data – A pandas DataFrame with all the column results for which we want to generate a RunSchema.
- scalar_data – A dict or pandas DataFrame with all the scalar results for which we want to generate a RunSchema.
- index – An optional list of the column names that can be used as unique identifiers.
Raises:DBNLInputValidationError – Input does not conform to expected format
Returns: The RunSchema based on the provided results

Examples:

create_test_session

Create a new TestSession with the given Run as the Experiment Run, and the given Run or RunQuery as the baseline if provided

Parameters:
- experiment_run – The Run to create the TestSession for
- baseline – The Run or RunQuery to use as the Baseline Run, defaults to None. If None, the Baseline set for the Project is used.
- include_tags – Optional list of Test Tag names to include in the Test Session.
- exclude_tags – Optional list of Test Tag names to exclude in the Test Session.
- require_tags – Optional list of Test Tag names to require in the Test Session.
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLInputValidationError – Input does not conform to expected format
Returns: The newly created TestSession

Calling this will start evaluating Tests associated with a Run. Typically, the Run you just completed will be the “Experiment” and you’ll compare it to some earlier “Baseline Run”.

IMPORTANT

Referenced Runs must already be closed before a Test Session can begin.

Managing Tags

Suppose we have the following Tests with the associated Tags in our Project

Test1 with tags [“A”, “B”]
Test2 with tags [“A”]
Test3 with tags [“B”]

include_tags=[“A”, “B”] will trigger Tests 1, 2, and 3. require_tags=[“A”, “B”] will only trigger Test 1. exclude_tags=[“A”] will only trigger Test 3. include_tags=[“A”] and exclude_tags=[“B”] will only trigger Test 2.

Examples:

delete_metric

Delete a DBNL Metric by ID

Parameters:metric_id – ID of the metric to delete
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLAPIValidationError – DBNL API failed to validate the request
Returns: None

export_project_as_json

Export a Project alongside its Test Specs, Tags, and Notification Rules as a JSON object

Parameters:project – The Project to export as JSON.
Raises:DBNLNotLoggedInError – dbnl SDK is not logged in
Returns: JSON object representing the Project

Sample Project JSON

Examples:

get_column_results

Get column results for a Run

Parameters:run – The Run from which to retrieve the results.
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLInputValidationError – Input does not conform to expected format
- DBNLDownloadResultsError – Failed to download results (e.g. Run is not closed)
Returns: A pandas DataFrame of the column results for the Run.

IMPORTANT

You can only retrieve results for a Run that has been closed.

Examples:

get_latest_run

Get the latest Run for a project

Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLResourceNotFoundError – Run not found
Parameters:project – The Project to get the latest Run for
Returns: The latest Run

get_latest_run_config

(Deprecated) Please see get_latest_run and access the schema attribute instead.

Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLResourceNotFoundError – RunConfig not found
Parameters:project – DBNL Project to get the latest RunConfig for
Returns: Latest RunConfig

get_metric_by_id

Get a DBNL Metric by ID

Parameters:metric_id – ID of the metric to get
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLAPIValidationError – DBNL API failed to validate the request
Returns: The requested metric

get_my_namespaces

Get all the namespaces that the user has access to

Raises:DBNLNotLoggedInError – dbnl SDK is not logged in
Returns: List of namespaces

get_or_create_project

Get the Project with the specified name or create a new one if it does not exist

Parameters:
- name – Name for the Project
- description – Description for the DBNL Project, defaults to None
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLAPIValidationError – DBNL API failed to validate the request
Returns: Newly created or matching existing Project

Examples:

get_project

Retrieve a Project by name.

Parameters:name – The name for the existing Project.
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLProjectNotFoundError – Project with the given name does not exist.
Returns: Project

Examples:

get_results

Get all results for a Run

Parameters:run – The Run from which to retrieve the results.
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLInputValidationError – Input does not conform to expected format
- DBNLDownloadResultsError – Failed to download results (e.g. Run is not closed)
Returns: A named tuple comprised of columns and scalars fields. These are the pandas DataFrames of the uploaded data for the Run.

IMPORTANT

You can only retrieve results for a Run that has been closed.

Examples:

get_run

Retrieve a Run with the given ID

Parameters:run_id – The ID of the dbnl Run. Run ID starts with the prefix run_. Run ID can be found at the Run detail page.
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLInputValidationError – Input does not conform to expected format
- DBNLRunNotFoundError – A Run with the given ID does not exist.
Returns: The Run with the given run_id.

Examples:

get_run_config

(Deprecated) Please access Run.schema instead.

Parameters:run_config_id – The ID of the DBNL RunConfig to retrieve
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLInputValidationError – Input does not conform to expected format
Returns: RunConfig with the given run_config_id

get_run_config_from_latest_run

(Deprecated) Please see get_latest_run and access the schema attribute instead.

Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLResourceNotFoundError – RunConfig not found
Parameters:project – DBNL Project to get the latest RunConfig for
Returns: RunConfig from the latest Run

get_run_query

Retrieve a DBNL RunQuery with the given name, unique to a project

Parameters:
- project – The Project from which to retrieve the RunQuery.
- name – The name of the RunQuery to retrieve.
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLRessourceNotFoundError – RunQuery not found
Returns: RunQuery with the given name.

Examples:

get_scalar_results

Get scalar results for a Run

Parameters:run – The Run from which to retrieve the scalar results.
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLInputValidationError – Input does not conform to expected format
- DBNLDownloadResultsError – Failed to download results (e.g. Run is not closed)
Returns: A pandas DataFrame of the scalar results for the Run.

IMPORTANT

You can only retrieve results for a Run that has been closed.

Examples:

import_project_from_json

Create a new Project from a JSON object

Parameters:params – JSON object representing the Project, generally based on a Project exported via export_project_as_json(). See export_project_as_json() for the expected format.
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLAPIValidationError – DBNL API failed to validate the request
- DBNLConflictingProjectError – Project with the same name already exists
Returns: Project created from the JSON object

Examples:

Setup dbnl SDK to make authenticated requests. After login is run successfully, the dbnl client will be able to issue secure and authenticated requests against hosted endpoints of the dbnl service.

Parameters:
- api_token – DBNL API token for authentication; token can be found at /tokens page of the DBNL app. If None is provided, the environment variable DBNL_API_TOKEN will be used by default.
- namespace_id – DBNL namespace ID to use for the session; available namespaces can be found with get_my_namespaces().
- api_url – The base url of the Distributional API. For SaaS users, set this variable to api.dbnl.com. For other users, please contact your sys admin. If None is provided, the environment variable DBNL_API_URL will be used by default.
- app_url – An optional base url of the Distributional app. If this variable is not set, the app url is inferred from the DBNL_API_URL variable. For on-prem users, please contact your sys admin if you cannot reach the Distributional UI.

report_column_results

Report all column results to dbnl

Parameters:
- run – The Run that the results will be reported to
- data – A pandas DataFrame with all the results to report to dbnl. The columns of the DataFrame must match the columns of the Run’s schema.
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLInputValidationError – Input does not conform to expected format

IMPORTANT

All data should be reported to dbnl at once. Calling dbnl.report_column_results more than once will overwrite the previously uploaded data.

WARNING

Once a Run is closed, you can no longer call report_column_results to send data to dbnl.

Examples:

report_results

Report all results to dbnl

Parameters:
- run – The Run that the results will be reported to
- column_data – A pandas DataFrame with all the results to report to dbnl. The columns of the DataFrame must match the columns of the Run’s schema.
- scalar_data – A dictionary or single-row pandas DataFrame with the scalar results to report to dbnl, defaults to None.
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLInputValidationError – Input does not conform to expected format

IMPORTANT

All data should be reported to dbnl at once. Calling dbnl.report_results more than once will overwrite the previously uploaded data.

WARNING

Once a Run is closed, you can no longer call report_results to send data to dbnl.

Examples:

report_run_with_results

Create a new Run, report results to it, and close it.

Parameters:
- project – The Project to create the Run in.
- column_data – A pandas DataFrame with the results for the columns.
- scalar_data – An optional dictionary or DataFrame with the results for the scalars, if any.
- display_name – An optional display name for the Run.
- index – An optional list of column names to use as the unique identifier for rows in the column data.
- run_schema – An optional RunSchema to use for the Run. Will be inferred from the data if not provided.
- metadata – Any additional key:value pairs you want to track.
- wait_for_close – If True, the function will block for up to 3 minutes until the Run is closed, defaults to True.
- row_id – (Deprecated) Do not use. Use index instead.
- run_config_id – (Deprecated) Do not use. Use run_schema instead.
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLInputValidationError – Input does not conform to expected format
Returns: The closed Run with the uploaded data.

IMPORTANT

If no schema is provided, the schema will be inferred from the data. If provided, the schema will be used to validate the data.

Examples:

Implicit Schema

Explicit Schema

report_run_with_results_and_start_test_session

Create a new Run, report results to it, and close it. Wait for close to finish and start a TestSession with the given inputs.

Parameters:
- project – The Project to create the Run in.
- column_data – A pandas DataFrame with the results for the columns.
- scalar_data – An optional dictionary or DataFrame with the results for the scalars, if any.
- display_name – An optional display name for the Run.
- index – An optional list of column names to use as the unique identifier for rows in the column data.
- run_schema – An optional RunSchema to use for the Run. Will be inferred from the data if not provided.
- metadata – Any additional key:value pairs you want to track.
- wait_for_close – If True, the function will block for up to 3 minutes until the Run is closed, defaults to True.
- baseline – DBNL Run or RunQuery to use as the baseline run, defaults to None. If None, the baseline defined in the TestConfig is used.
- include_tags – Optional list of Test Tag names to include in the Test Session.
- exclude_tags – Optional list of Test Tag names to exclude in the Test Session.
- require_tags – Optional list of Test Tag names to require in the Test Session.
- row_id – (Deprecated) Do not use. Use index instead.
- run_config_id – (Deprecated) Do not use. Use run_schema instead.
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLInputValidationError – Input does not conform to expected format
Returns: The closed Run with the uploaded data.

IMPORTANT

If no schema is provided, the schema will be inferred from the data. If provided, the schema will be used to validate the data.

Examples:

report_scalar_results

Report scalar results to dbnl

Parameters:
- run – The Run that the scalars will be reported to
- data – A dictionary or single-row pandas DataFrame with the scalar results to report to dbnl.
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLInputValidationError – Input does not conform to expected format

IMPORTANT

All data should be reported to dbnl at once. Calling dbnl.report_scalar_results more than once will overwrite the previously uploaded data.

WARNING

Once a Run is closed, you can no longer call report_scalar_results to send data to dbnl.

Examples:

set_run_as_baseline

Set the given Run as the Baseline Run in the Project’s Test Config

Parameters:run – The Run to set as the Baseline Run.
Raises:DBNLResourceNotFoundError – If the test configurations are not found for the project.

set_run_query_as_baseline

Set a given RunQuery as the Baseline Run in a Project’s Test Config

Parameters:run_query – The RunQuery to set as the Baseline RunQuery.
Raises:DBNLResourceNotFoundError – If the test configurations are not found for the project.

wait_for_run_close

Wait for a Run to close. Polls every polling_interval_s seconds until it is closed.

Parameters:
- run – Run to wait for
- timeout_s – Total wait time (in seconds) for Run to close, defaults to 180.0
- polling_interval_s – Time between polls (in seconds), defaults to 3.0
Raises:
- DBNLNotLoggedInError – dbnl SDK is not logged in
- DBNLError – Run did not close after waiting for the timeout_s seconds