Only this pageAll pages
Powered by GitBook
1 of 56

v0.22.x

Get Started

Loading...

Loading...

Loading...

Loading...

Learning about Distributional

Loading...

Loading...

Loading...

Using Distributional

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Platform

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Reference

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Versions

Loading...

Getting Access to Distributional

Getting Access

Quickstart

Get started with dbnl

This guide walks you through using the Distributional SDK to create your first project, submit two runs and create a test session to compare the behavior of your AI application over time.

1

Install the dbnl SDK.

First, if you haven't done so already, install the dbnl Python SDK.

pip install dbnl
2

Export the dbnl environment variables

If you haven't done so already, create a personal access token by going to ☰ > Personal Access Tokens.

Run this command to export the path to the API for your deployment and your personal access token as environment variables.

export DBNL_API_URL="YOUR_DBNL_API_URL"
export DBNL_API_TOKEN="YOUR_PERSONAL_ACCESS_TOKEN"
3

Create and test two Runs

In a Python script or notebook, add the following code to create two Runs and test them in a Test Session using the first run as the baseline and the second run as the experiment.

import random
from datetime import datetime

import dbnl
import pandas as pd

# Login to dbnl.
dbnl.login()

# Create a new project
now = datetime.now().isoformat()
project = dbnl.get_or_create_project(name=f"quickstart-{now}")

# Submit a first run (baseline)
run1 = dbnl.report_run_with_results(
    project=project,
    display_name="run1",
    column_data=pd.DataFrame([
        {
            "question": f"Is {i} an even or odd number?",
            "answer": random.choice(["even", "odd"]),
        }
        for i in range(20)
    ]).astype({"question": "string", "answer": "category"}),
)

# Submit a second run (experiment)
run2 = dbnl.report_run_with_results(
    project=project,
    display_name="run2",
    column_data=pd.DataFrame([
        {
            "question": f"Is {i} an even or odd number?",
            "answer": random.choice(["even", "odd"]),
        }
        for i in range(20)
    ]).astype({"question": "string", "answer": "category"}),
)

# Create a test session.
dbnl.set_run_as_baseline(run=run1)
dbnl.create_test_session(experiment_run=run2)
4

View your Test Session results

Congratulations! You ran your first Test Session. You can see the results of the Test Session by navigating to your project in the dbnl app and selecting your test session from the test session table.

By default, a similarity index test is added that tests whether your application has changed between the baseline and experiment run.

Next Steps

Overview

Distributional's adaptive testing platform

Distributional is an adaptive testing platform purpose-built for AI applications. It enables you to test AI application data at scale to define, understand, and improve your definition of AI behavior to ensure consistency and stability over time.


Adaptive Testing Workflow

Define Desired Behavior Automatically create a behavioral fingerprint from the app’s runtime logs and any existing development metrics, and generate associated tests to detect changes in that behavior over time.


Integrating Distributional with Other Tools

Why We Test Data Distributions

Adaptive testing requires a very different approach than traditional software testing. The goal of adaptive testing is to enable teams to define a steady baseline state for any AI application, and through testing, confirm that it maintains steady state, and where it deviates, figure out what needs to evolve or be fixed to reach steady state once again. This process needs to be discoverable, logged, organized, consistent, integrated and scalable.

Testing AI applications needs to be fundamentally reimagined to include statistical tests on distributions of quantities to detect meaningful shifts that warrant deeper investigation.

  • Distributions > Summary Statistics: Instead of only looking at summary statistics (e.g. mean, median, P90), we need to analyze distributions of metrics, over time. This accounts for the inherent variability in AI systems while maintaining statistical rigor.

Why is this useful? Imagine you have an application that contains an LLM and you want to make sure that the latency of the LLM remains low and consistent across different types of queries. With a traditional monitoring tool, you might be able to easily monitor P90 and P50 values for latency. P50 represents the latency value below which 50% of the requests fall and will give you a sense of the typical (median) response time that users can expect from the system. However, the P50 value for a normal distribution and bimodal distribution can be the same value, even though the shape of the distribution is meaningfully different. This can hide significant (usage-based or system-based) changes in the application that affect the distribution of the latency scores. If you don’t examine the distribution, these changes go unseen.

Consider a scenario where the distribution of LLM latency started with a normal distribution, but due to changes in a third-party data API that your app uses to inform the response of the LLM, the latency distribution becomes bimodal, though with the same median (and P90 values) as before. What could cause this? Here’s a practical example of how something like this could happen. The engineering team of the data API organization made an optimization to their API which allows them to return faster responses for a specific subset of high value queries, and routes the remainder of the API calls to a different server which has a slower response rate.

The effect that this has on your application is that now half of your users are experiencing an improvement in latency, and now a large number of users are experiencing “too much” latency and there’s an inconsistent performance experience among users. Solutions to this particular example include modifying the prompt, switching the data provider to a different source, format the information that you send to the API differently or a number of other engineering solutions. If you are not concerned about the shift and can accept the new steady state of the application, you can also choose to not make changes and declare a new acceptable baseline for the latency P50 value.

Install the Python SDK

Installing the Python SDK and Accessing Distributional UI

Installing Distributional

1. Latest Stable Release

To install the latest stable release of the dbnl package:

2. Specific Release

To install a specific version (e.g., version 0.22.0):

3. Installing with the eval Extra

The dbnl.eval extra includes additional features and requires an external spaCy model.

3.1. Install the Required spaCy Model

To install the required en_core_web_sm pretrained English-language NLP model model for spaCy:

3.2. Install dbnl with the eval Extra

To install dbnl with evaluation extras:

If you need a specific version with evaluation extras (e.g., version 0.22.0):

4. Accessing the Distributional UI and API token

We recommend setting your API token as an environment variable, see below.

5. Environment Variables

DBNL has three reserved environment variables that it reads in before execution.

Set up for various deployment types

Version Matching Requirements

DBNL provides different versions of the API and SDK. Ensuring compatibility is critical for proper functionality. SDK and API server versions must match major and minor version numbers.

To check your SDK version:

To check your API server version:

  • Logging into the web app

  • Clicking the hamburger menu (☰) on the top-left corner

  • Viewing the version number listed in the footer

Want access to the Distributional platform? . We’ll guide you through the process and ensure you have everything you need to get started.

While we offer SaaS and a for testing purposes, neither are suitable for a production environment. We recommend our option if you plan on deploying the dbnl platform directly in your cloud or on-premise environment.

Create a for your own AI application.

Upload your own data as to your Project.

Define to augment your Runs with novel quantities.

Add more to ensure your application behaves as expected.

Learn more about .

Use to be alerted when tests fail.

For access to the Distributional platform, .

Understand Changes in Behavior Get alerted when there are , understand what is changing, and pinpoint at any level of depth what is causing the change to quickly take appropriate action.

Improve Based on Changes Easily add, remove, or recalibrate over time so you always have a dynamic representation of desired state that you can use to test new models, roll out new upgrades, or accelerate new app development.

Distributional’s platform is designed to easily with your existing infrastructure, including data stores, orchestrators, alerting tools, and AI platforms. If you are already using a model evaluation framework as part of app development, those can be used as an input to further define behavior in Distributional.

Ready to start using Distributional? Head straight to our to get set up on the platform and start testing your AI application.

For getting access to the Distributional platform, .

The dbnl SDK supports . You can install the latest release of the SDK with the following command on Linux or macOS, install a specific release, and install :

You should have already received an invite email from the Distributional team to create your account. If that is not the case, please reach out to your Distributional contact. You can access and/or generate your token at (which will prompt you to login if you are not already).

Variable Name
Description

DBNL has three available deployment types, SaaS, , and .

Please reach out to our team
sandbox deployment
self-hosted deployment
Project
Runs
Metrics
Tests
Similarity Index
Notifications
please reach out to our team
changes to app behavior
tests
integrate
Quick Start
pip install dbnl
pip install "dbnl==0.22.0"
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl
pip install "dbnl[eval]"
pip install "dbnl[eval]==0.22.0"

DBNL_API_TOKEN

The API token used to authenticate your dbnl account.

DBNL_API_URL

The base url of the Distributional API. For SaaS users, set this variable to api.dbnl.com. For other users, please contact your sys admin.

DBNL_APP_URL

An optional base url of the Distributional app. If this variable is not set, the app url is inferred from the DBNL_API_URL variable. For on-prem users, please contact your sys admin if you cannot reach the Distributional UI.

#Run the following commands in your terminal. Make sure to wrap the API token in quotes.

export DBNL_API_TOKEN="copy_paste_dbnl_api_token"
export DBNL_API_URL="api.dbnl.com"
export DBNL_APP_URL="app.dbnl.com"
export DBNL_API_URL="localhost:8080/api"
export DBNL_APP_URL="localhost:8080"
export DBNL_API_URL="<CUSTOMER_SPECIFIC_API_URL>"
export DBNL_APP_URL="<CUSTOMER_SPECIFIC_APP_URL>"
import dbnl
print(dbnl.__version__)
please reach out to our team
Python versions 3.9-3.12
https://app.dbnl.com/tokens
Sandbox
Full On-Premise

Metrics

What are Metrics?

Metrics are measurable properties that help quantify specific characteristics of your data. Metrics can be user-defined, by providing a numeric column computed from your source data alongside your application data.

Alternatively, the Distributional SDK offers a comprehensive set of metrics for evaluating various aspects of text and LLM outputs. Using Distributional's methods for computing metrics will enable better data-exploration and application stability monitoring capabilities.

Using Metrics

The SDK provides convenient functions for computing metrics from your data and reporting the results to Distributional:

import dbnl
import dbnl.eval
import pandas as pd

# login to dbnl
dbnl.login()
project = dbnl.create_project(name="Metrics Project")

df = pd.DataFrame(
    {
        "id": [1, 2, 3],
        "question": [
            "What is the meaning of life?",
            "What is the airspeed velocity of an unladen swallow?",
            "What is the capital of Assyria?",
        ],
        "answer": [
            "To be happy and fulfilled.",
            "It's a question of aerodynamics.",
            "Nineveh was the capital of Assyria.",
        ],
        "expected_answer": [
            "42",
            "It's a question of aerodynamics.",
            "Nineveh was the capital of Assyria.",
        ],
    }
)

# Create individual metrics
metrics = [
    dbnl.eval.metrics.token_count("question"),
    dbnl.eval.metrics.word_count("question"),
    dbnl.eval.metrics.rouge1("answer", "expected_answer"),
]

# Compute metrics and report results to Distributional
run = dbnl.eval.report_run_with_results(
    project=project, column_data=df, metrics=metrics
)

Convenience Functions

The SDK includes helper functions for creating common groups of related metrics based on consistent inputs.

import dbnl
import dbnl.eval
import pandas as pd

# login to dbnl
dbnl.login()
project = dbnl.create_project(name="Metrics Project")

df = pd.DataFrame(
    {
        "id": [1, 2, 3],
        "question": [
            "What is the meaning of life?",
            "What is the airspeed velocity of an unladen swallow?",
            "What is the capital of Assyria?",
        ],
        "answer": [
            "To be happy and fulfilled.",
            "It's a question of aerodynamics.",
            "Nineveh was the capital of Assyria.",
        ],
        "expected_answer": [
            "42",
            "It's a question of aerodynamics.",
            "Nineveh was the capital of Assyria.",
        ],
    }
)

# Get standard text evaluation metrics
text_eval_metrics = dbnl.eval.metrics.text_metrics(
    prediction="answer", target="expected_answer"
)

# Get comprehensive QA evaluation metrics
qa_metrics = dbnl.eval.metrics.question_and_answer_metrics(
    prediction="answer",
    target="expected_answer",
    input="question",
)

# Compute metrics and report results to Distributional
run = dbnl.eval.report_run_with_results(
    project=project, column_data=df, metrics=(text_eval_metrics + qa_metrics)
)

Runs

The Run is the core object for recording an application's behavior; when you upload a dataset from usage of your app, it takes the shape of a Run. As such, you can think of a Run as the results from a batch of uses or from a standard example set of uses of your application. When exploring or testing your app's behavior, you will look at the Run in dbnl either in isolation or in comparison to another Run.

What's in a Run?

A Run contains the following:

  • a table of results where each row holds the data related to a single app usage (e.g. a single input and output along related metadata),

  • a set of Run-level values, also known as scalars,

  • structural information about the components of the app and how they relate, and

  • user-defined metadata for remembering the context of a run.

Your Project will contain many Runs. As you report Runs into your Project, dbnl will build a picture of how your application is behaving, and you will utilize tests to verify that its behavior is appropriate and consistent. Some common usage patterns would be reporting a Run daily for regular checkpoints or reporting a Run each time you deploy a change to your application.

The structure of a Run is defined by its schema. This informs dbnl about what information will be stored in each result (the columns), what Run-level data will be reported (the scalars), and how the application is organized (the components).

Baseline and Experiment Runs

The contains more details on Metrics including some example usage.

See the for a more complete list and description of available metrics.

See the for a more complete list and description of available functions.

Generally, you will use our to report Runs. The data associated with each Run is passed to dbnl as pandas dataframes.

A component is a mechanism for grouping columns based on their role within the app. You can also define an index in your schema to tell dbnl a unique identifier for the rows in your Run results. For more information, see the section on the .

Throughout our application and documentation, you'll often encounter the terms "baseline" and "experiment". These concepts are specifically related to running tests in dbnl. The Baseline Run defines the comparison point when running a test; the Experiment Run is the Run which is being tested against that comparison point. For more information, see the sections on and .

SDK documentation
SDK documentation
SDK documentation
Python SDK
through the SDK
Setting a Baseline Run
Running Tests

What Is a Similarity Index?

Similarity Index is a single number between 0 and 100 that quantifies how much your application’s behavior has changed between two runs – a Baseline and an Experiment run. It is Distributional’s core signal for measuring application drift, automatically calculated and available in every Test Session.

A lower score indicates a greater behavioral change in your AI application. Each Similarity Index has accompanying Key Insights with a description to help users understand and act on the behavioral drift that Distributional has detected.

Where You’ll See It in the UI

  • Test Session Summary Page — App-level Similarity Index, results of failed tests, and Key Insights

  • Similarity Report Tab — Breakdown of Similarity Indexes by column and metric

  • Column Details View — Histograms and statistical comparison for specific metrics

  • Tests View — History of Similarity Index-based test pass/fail over time

Why It Matters

When model behavior changes, you need:

  1. A clear signal that drift occurred

  2. An explanation of what changed

  3. A workflow to debug, test, and act

Similarity Index + Key Insights provides all three.

Example:

  • An app’s Similarity Index drops from 93 → 46

  • Key Insight: “Answer similarity has decreased sharply”

  • Metric: levenshtein__generated_answer__expected_answer

Result: Investigate histograms, set test thresholds, adjust model

Hierarchical Structure

Similarity Index operates at three levels:

  • Application Level — Aggregates all lower-level scores

  • Column Level — Individual column-level drift

  • Metric Level — Fine-grained metric change (e.g., readability, latency, BLEU score)

Each level rolls up into the one above it. You can sort by Similarity Index to find the most impacted parts of your app.

Test Sessions and Thresholds

By default, a new DBNL project comes with an Application-level Similarity Index test:

  • Threshold: ≥ 80

  • Failure: Indicates meaningful application behavior change

In the UI:

  • Passed tests are shown in green

  • Failed tests are shown in red with diagnostic details

All past test runs can be reviewed in the test history.

Key Insights

Key Insights are human-readable interpretations of Similarity Index changes. They answer:

“What changed, and does it matter?”

Each Key Insight includes:

  • A plain-language summary: “Distribution substantially drifted to the right”

  • The associated column/metric

  • The Similarity Index for that metric

  • Option to add a test on the spot

Example:

Distribution substantially drifted to the right.

→ Metric: levenshtein__generated_answer__expected_answer

→ Similarity Index: 46

→ Add Test

Insights are prioritized and ordered by impact, helping you triage quickly.

Deep Dive: Column Similarity Details

Clicking into a Key Insight opens a detailed view:

  • Histogram overlays for experiment vs. baseline

  • Summary statistics (mean, median, percentile, std dev)

  • Absolute difference of statistics between runs

  • Links to add similarity or statistical tests on specific metrics

This helps pinpoint whether drift was due to longer answers, slower responses, or changes in generation fidelity.

Frequently Asked Questions

What’s considered “low” similarity?
  • Below 80 = significant drift (default failure threshold)

  • Below 60 = usually signals substantial regression or change

Can I configure the thresholds?

Yes — Similarity Index thresholds can be adjusted, and custom tests can be created at any level (app, column, metric).

Do I need to set anything up to use Similarity Index?

No. For all numeric columns that overlap between Baseline and Experiment runs, and non-numeric columns with defined metrics, this is automatically run.

What columns does Similarity Index apply to?

Only numeric columns and derived metrics (e.g., response time, BLEU, readability). String values are not supported yet.

Example Workflow

  1. Run a test session

  2. Similarity Index < 80 → test fails

  3. Review top-level Key Insights

  4. Click into a metric (e.g., levenshtein__generated_answer__expected_answer)

  5. View distribution shift and statistical breakdown

  6. Add targeted test thresholds to monitor ongoing behavior

  7. Adjust model, prompt, or infrastructure as needed

The Flow of Data

Your data + DBNL testing == insights about your app's behavior

Distributional uses data generated by your AI-powered app to study its behavior and alert you to valuable insights or worrisome trends. The diagram below gives a quick summary of this process:

  • Each app usage involves input(s), the resulting output(s), and context about that usage

    • Example: Input is a question from a user; Output is your app’s answer to that question; Context is the time/day that the question was asked.

  • As the app is used, you record and store the usage in a data store for later review

    • Example: At 2am every morning, an Airflow job parses all of the previous day’s app usages and sends that info to a data store.

  • When data is moved to your data store, it is also submitted to DBNL for testing.

    • Example: The 2am Airflow job is amended to include data augmentation by DBNL Eval and uploading of the resulting Run to trigger automatic app testing.

A Run usually contains many (e.g., dozens or hundreds) rows of inputs + outputs + context, where each row was generated by an app usage. Our insights are statistically derived from the distributions estimated by these rows.

You can read more about the DBNL specific terms . Simply stated, a Run contains all of the data which DBNL will use to test the behavior of your app – insights about your app’s behavior will be derived from this data.

is our library that provides access to common, well-tested GenAI evaluation metrics. You can use DBNL Eval to augment data in your app, such as the inputs and outputs. You are also able to bring your own eval metrics and use them in conjunction with DBNL Eval or standalone. Doing so produces a broader range of tests that can be run, and it allows the platform to produce more powerful insights.

earlier in the documentation

Users and Permissions

Discover how dbnl manages user permissions through a layered system of organization and namespace roles—like org admin, org reader, namespace admin, writer, and reader.

Users

A user is an individual who can log into a dbnl organization.

Permissions

Permissions are settings that control access to operations on resources within a dbnl organization. Permissions are made up of two components.

  • Resource: Defines which resource is being controlled by this permission (e.g. projects, users).

  • Verb: Defines which operations are being controlled by this permission (e.g. read, write).

For example, the projects.read permission controls access to the read operations on the projects resource. It is required to be able to list and view projects.

Roles

A role consists in a set of permissions. Assigning a role to a user gives the user all the permissions associated with the role.

Roles can be assigned at the organization or namespace level. Assigning roles at the namespace level allows for giving users granular access to projects and their related data.

Org Roles

An org role is a role that can be assigned to a user within an organization. Org role permissions apply to resources across all namespaces.

There are two default org roles defined in every organization.

Org admin

The org admin role has read and write permissions for all org level resources making it possible to perform organization management operations such as creating namespaces and assigning users roles.

By default, the first user in an org is assigned the org admin role.

Org reader

The org reader role has read-only permissions to org level resources making it possible to navigate the organization by listing users and namespaces.

By default, all users are assigned the org reader role.

Assigning a User an Org Role

To assign a user an org role, go to ☰ > Settings > Admin > Users, scroll to the relevant user and select the an org role from the dropdown in the Org Role column.

Assigning a user an org role requires having the org admin role.

Namespace Roles

A namespace role is a role that can be assigned to a user within a namespace. Namespace role permissions only apply to resources defined within the namespace in which the role is assigned.

There are three default namespace roles defined in every organization.

Namespace admin

The namespace admin role has read and write permissions for all namespace level resources within a namespace making it possible to perform namespace management operations such as assigning users roles within a namespace.

By default, the creator of a namespace is assigned the namespace admin role in that namespace.

Namespace writer

The namespace admin role has read and write permissions for all namespace level resources within a namespace except for those resources and operations related to namespace management such as namespace role assignments.

By default, all users are assigned the namespace writer role in the default namespace.

(Experimental) Namespace reader

The namespace reader role has read-only permissions for all namespace level resources within a namespace.

This is an experimental role that is available through the API, but is not currently fully supported in the UI.

Assigning a User a Namespace Role

To assign a user a namespace role within a namespace, go to ☰ > Settings > Admin > Namespaces, scroll and click on the relevant namespace and then click + Add User.

Assigning a user a namespace role requires having the org admin role or the namespace admin role in that namespace.

Adaptive testing workflow in Distributional
Integrate Distributional with your existing infrastructure

Deployment

Instructions for self-hosted deployment options

There are two main options to deploy the dbnl platform as a self-hosted deployment:

  • Helm chart: The dbnl platform can be deployed using a Helm chart to existing infrastructure provisioned by the customer.

  • Terraform module: The dbnl platform can be deployed using a Terraform module on infrastructure provisioned by the module alongside the platform. This is options is supported on AWS and GCP.

Which option to choose depends on your situation. The Helm chart provides maximum flexibility, allowing users to provision their infrastructure using their own processes, while the Terraform module provides maximum simplicity, reducing the installation to single Terraform command.

Run Schema

Distributional Concepts

Understanding key concepts and their role relative to your app

Adaptive testing for AI applications requires more information than standard deterministic testing. This is because:

  • AI applications are multi-component systems where changes in one part can affect others in unexpected ways. For instance, a change in your vector database could affect your LLM's responses, or updates to a feature pipeline could impact your machine learning model's predictions.

  • AI applications are non-stationary, meaning their behavior changes over time even if you don't change the code. This happens because the world they interact with changes - new data comes in, language patterns evolve, and third-party models get updated. A test that passes today might fail tomorrow, not because of a bug, but because the underlying conditions have shifted.

  • AI applications are non-deterministic. Even with the exact same input, they might produce different outputs each time. Think of asking an LLM the same question twice - you might get two different, but equally valid, responses. This makes it impossible to write traditional tests that expect exact matches.

To account for this, each time you want to measure the behavior of the AI application, you will need to:

  1. Record outcomes at all of the app’s components, and

  2. Push a distribution of inputs through the app to study behavior across the full spectrum of possible app usage.

The inputs, outputs, and outcomes associated with a single app usage are grouped in a Result, with each value in a result described as a Column. The group of results that are used to measure app behavior is called a Run. To determine if an app is behaving as expected, you create a Test, which involves statistical analysis on one or more runs. When you apply your tests to the runs that you want to study, you create a Test Session, which is a permanent record of the behavior of an app at a given time.

Tokens

Tokens are used for programmatic access to the dbnl platform.

Personal Access Tokens

A personal access token is a token that can be used for programmatic access to the dbnl platform through the SDK.

Tokens are not revocable at this time. Please remember to keep your tokens safe.

Permissions

Token permissions are resolved at use time, not creation time. As such, changing the user permissions after creating a personal access token will change the permissions of the personal access token.

Create a Personal Access Token

To create a new personal access token, go to ☰ > Personal Access Tokens and click Create Token.

Setting a Baseline Run

Dynamic Baseline (Run Queries)

Setting a Default Baseline Run

You can set a default Baseline Run to be used in all Test Sessions either via the UI or the SDK. Additionally, you can create a Run Query to make your Baseline Run dynamic for each Test Session.

Projects

What's in a Project?

Creating a Project

From Scratch

You can create a Project either via the UI or the SDK:

Simply click the "Create Project" button from the Project list view.

import dbnl
dbnl.login()

project = dbnl.create_project(
    name="My Project",
    description="This is a very important project."
)

Copying a Project

You can quickly copy an existing Project to get up and running with a new one. This will copy the following items into your new Project:

  • Test specifications

  • Test tags

  • Notification rules

There are a couple of ways to copy a Project.

Exporting and Importing

Any Project can be exported to a JSON file; that JSON file can then be adjusted to your liking and imported as a new Project. This is doable both via the UI and the SDK:

Exporting

To export a Project, simply click the download icon on the Project page, in the header.

This will download the Project's JSON to your computer. There is an example JSON in the expandable section below.

Importing

Once you have a Project JSON, you can edit it as you'd like, and then import it by clicking the "Create Project" button on the Project list and then clicking the "Import from File" tab.

Fill out the name and description, click "Create Project", and you're all set!

Exporting and importing a Project is done easily via the SDK functions export_project_as_json and import_project_from_json.

import dbnl
dbnl.login()

# Export
project_1 = dbnl.get_or_create_project(name="Existing Project")
export_json = dbnl.export_project_as_json(project=proj1)
# Adjust the project values as you'd like. You will need to change the name.
# An example of the JSON structure is in the collapsible section below
export_json["project"]["name"] = "New Project"

# Import
project_2 = dbnl.import_project_from_json(params=export_json)
Sample Project JSON
{
    "project": {
        "name": "My Project",
        "description": "This is my project."
    },
    "notification_rules": [
        {
            "conditions": [
                {
                    "assertion_name": "less_than",
                    "assertion_params": { "other": 0.85 },
                    "query_name": "test_status_percentage_query",
                    "query_params": {
                        "exclude_tag_ids": [],
                        "include_tag_ids": [],
                        "require_tag_ids": [],
                        "statuses": ["PASSED"]
                    }
                }
            ],
            "name": "Alert if passed tests are less than 85%",
            "notification_integration_names": ["Notification channel"],
            "status": "ENABLED",
            "trigger": "test_session.failed"
        }
    ],
    "tags": [
        {
            "name": "my-tag",
            "description" :"This is my tag."
        }
    ],
    "test_specs": [
        {
            "assertion": { "name": "less_than", "params": { "other": 0.5 } },
            "description": "Testing the difference in the example statistic",
            "name": "Gr.0: Non Parametric Difference: Example_Statistic",
            "statistic_inputs": [
                {
                    "select_query_template": {
                        "filter": null,
                        "select": "{EXPERIMENT}.Example_Statistic"
                    }
                },
                {
                    "select_query_template": {
                        "filter": null,
                        "select": "{BASELINE}.Example_Statistic"
                    }
                }
            ],
            "statistic_name": "my_stat",
            "statistic_params": {},
            "tag_names": ["my-tag"]
        }
    ]
}

Copying

You can also just directly copy a given Project. Again, this can be done via the UI or the SDK:

There are two ways to copy a Project from the UI:

From the "Create Project" Modal

In the Project list, after you click "Create Project", you can navigate to the "Copy Existing" tab and choose a Project from the dropdown.

From a Project Page

While viewing a Project, you can click the copy icon in the header to copy it to a new Project.

Copying a Project is done easily via the SDK function copy_project.

import dbnl
dbnl.login()


project_1 = dbnl.get_or_create_project(name="Existing Project")
project_2 = dbnl.copy_project(project=project_1, name="New Project")

Reporting Runs

The full process of reporting a Run ultimately breaks down into three steps:

  1. Creating the Run, which includes defining its structure and any relevant metadata

  2. Reporting the results of the Run, which include columnar data and scalars

  3. Closing the Run to mark it as complete once reporting is finished

Creating a Run

The important parts of creating a run are providing identifying information — in the form of a name and metadata — and defining the structure of the data you'll be reporting to it. As mentioned in the previous section, this structure is called the Run Schema.

Run Schema

A Run schema defines four aspects of the Run's structure:

  • Columns (the data each row in your results will contain)

  • Scalars (any Run-level data you want to report)

  • Index (which column or columns uniquely identify rows in your results)

  • Components (functional groups to organize the reported results in the form of a graph)

Columns

Columns are the only required part of a schema and are core to reporting Runs, as they define the shape your results will take. You report your column schema as a list of objects, which contain the following fields:

  • name: The name of the column

  • description: A descriptive blurb about what the column is

  • component: Which part of your application the column belongs to (see Components below)

Example Columns JSON
[
    { 
        "name": "error_type",
        "type": "category",
        "component": "classifier"
    },
    {
        "name": "email",
        "type": "string",
        "description": "raw email text content from source",
        "component": "input"
    },
    { 
        "name": "spam-pred",
        "type": "boolean",
        "component": "classifier"
    },
    {
        "name": "email_id",
        "type": "string",
        "description": "unique id for each email"
    }
]

Scalars

Example Scalars JSON
[
    {
        "name": "model_F1",
        "type": "float",
        "description": "F1 Score",
        "component": "classifier"
    },
    { 
        "name": "model_recall",
        "type": "float",
        "description": "Model Recall",
        "component": "classifier"
    }
]

Index

Using the index field within the schema, you have the ability to designate Unique Identifiers – specific columns which uniquely identify matching results between Runs. Adding this information facilitates more direct comparisons when testing your application's behavior and makes it easier to explore your data.

Example Index JSON
["email_id"]

Components

Example Components JSON
// Each key defines a component, and the corresponding list defines the
// components downstream from it in your DAG
{
    "input": ["classifier"]
    "classifier": [],
}

Note that if you do not provide a schema when you report a run, dbnl will infer one from the structure of the results you've uploaded. You can additionally still provide an index parameter directly to the report_run_with_results function.

Reporting Run Results

Once you've defined the structure of your run, you can upload data to dbnl to report the results of that run. As mentioned above, there are two kinds of results from your run:

  • The row-level column results (these each represent the data of a single "usage" of your application)

  • The Run-level scalar results (these represent data that apply to all usages in your Run as a whole)

dbnl expects you to upload your results data in the form of a pandas DataFrame. Note that scalars can be uploaded as a single-row DataFrame or as a dictionary of values.

Example Results
import pandas as pd

column_results = pd.DataFrame({
    "error_type": ["none", "none", "none", "none"],
    "email": [
        "Hello, I am interested in your product. Please send me more information.",
        "Congratulations! You've won a lottery. Click here to claim your prize.",
        "Hi, can we schedule a meeting for next week?",
        "Don't miss out on this limited time offer! Buy now and save 50%."
    ],
    "spam-pred": [False, True, False, True],
    "email_id": ["1", "2", "3", "4"]
})

scalar_results = pd.DataFrame({
    "model_F1": [0.8],
    "model_recall": [0.74]
})
# Above is equalent to:
scalar_results = {
    "model_F1": 0.8,
    "model_recall": 0.74 
}

Closing a Run

Putting it All Together

Now that you understand each step, you can easily integrate all of this into your codebase with a few simple function calls via our SDK:

import dbnl
import pandas as pd
dbnl.login()


proj = dbnl.get_or_create_project(name="My Project")
run_schema = dbnl.create_run_schema(
    columns=[
        {"name": "error_type", "type": "category", "component": "classifier"},
        {"name": "email", "type": "string", "description": "raw email text content from source", "component": "input"},
        {"name": "spam-pred", "type": "boolean", "component": "classifier"},
        {"name": "email_id", "type": "string", "description": "unique id for each email"},
    ],
    scalars=[
        {
            "name": "model_F1",
            "type": "float",
            "description": "F1 Score",
            "component": "classifier"
        },
        { 
            "name": "model_recall",
            "type": "float",
            "description": "Model Recall"
        }
    ],
    index=["email_id"],
    components_dag={
        "input": ["classifier"]
        "classifier": [],
    }
)
# Creates the run, reports results, and closes the run.
run = dbnl.report_run_with_results(
    project=proj,
    display_name="Run 1 of Email Classifier"
    run_schema=run_schema,
    column_data=pd.DataFrame({
        "error_type": ["none", "none", "none", "none"],
        "email": [
            "Hello, I am interested in your product. Please send me more information.",
            "Congratulations! You've won a lottery. Click here to claim your prize.",
            "Hi, can we schedule a meeting for next week?",
            "Don't miss out on this limited time offer! Buy now and save 50%."
        ],
        "spam-pred": [False, True, False, True],
        "email_id": ["1", "2", "3", "4"]
    }),
    scalar_data={
        "model_F1": 0.8,
        "model_recall": 0.74 
    }
)

Architecture

An overview of the architecture for the dbnl platform

The dbnl platform architecture consists of a set of services packaged as Docker images and a set of standard infrastructure components.

Infrastructure

The dbnl platform requires the following infrastructure:

  • A Kubernetes cluster to host the dbnl platform services.

  • A PostgreSQL database to store metadata.

  • An object store bucket to store raw data.

  • A Redis database to serve as a messaging queue.

  • A load balancer to route traffic to the API or UI service.

Services

The dbnl platform consists in three core services:

  • The API service (api-srv) serves the dbnl API and orchestrates work across the dbnl platform.

  • The worker service (worker-srv) processes async jobs scheduled by the API service.

  • The UI service (ui-srv) serves the dbnl UI assets.

Bimodal vs Normal distribution of Latency
Summary statistics for the Run uploaded above
Summary statistics for the Run uploaded above
Runs live within a selected Project, which serves as an organizing tool for the Runs created for a single app.
A Run in the dbnl UI
Runs in DBNL are created from data produced during the normal operation of your app, such as prompts (inputs) and responses (outputs).
The expected flow of app information to Distributional for run creation within a project
Diagram of the flow of data through DBNL

A personal access token has the same permissions as the user that created it. See for more details about permissions.

Personal access tokens are implemented using and are not persisted. Tokens cannot be recovered if lost and a new token will need to be created.

The "Baseline Run" is a core concept in dbnl that, appropriately, refers to the Run used as a baseline when executing a Test Session. Conversely, the Run being tested is called the "Experiment Run". Any that compare statistics will test the values in the experiment relative to the baseline.

Depending on your use case, you may want to make your Baseline Run dynamic. You can use a Run Query for this. Currently, dbnl supports setting a Run Query that looks back a number of previous runs. For example, in a production testing use case, you may want to use the previous Run as the baseline for each Test Session, so you'd create a Run Query that looks back 1 run. See the UI example in the section for information on how to create a Run Query. You can also create a Run Query .

You can choose a Baseline Run at the time of Test Session creation. If you do not provide one, dbnl will use your Project's default Baseline Run. See for more information.

From your Project, click the "Test Configuration" tab. Choose a Run or Run Query from the Baseline Run dropdown.

From your Project, click "Test Configuration" and select a Baseline Run in the dropdown.

You can set a Run as baseline via the set_run_as_baseline or set_run_query_as_baseline functions.

import dbnl

# Get a reference to a Run either by creating one or fetching by ID
run = dbnl.get_run(run_id="run_abc123") # or dbnl.report_result_with_results
dbnl.set_run_as_baseline(run=run)

# You can also use a Run Query for a dynamic baseline
project = dbnl.get_or_create_project(name="My Project")
run_query = dbnl.create_run_query(
  project=project,
  name="Look back 3 runs",
  query={
    "offset_from_now": 3,
  },
)
dbnl.set_run_query_as_baseline(run_query=run_query)

Projects are the main organizational tool in dbnl. Each Project lives within a in your and is accessible by everyone in that Namespace. Generally, you'll create one Project for every AI application that you'd like to test with dbnl. The list of Projects available to you is the default landing page when browsing to the dbnl UI.

Your Project will contain all of your app's — a collection of results from your app — and all of the that you've defined to monitor the behavior of your app. It also has a name and various configurable properties like a and .

The Project Header in dbnl
Project List

Creating a project with the SDK can be done easily with the function.

Exporting a Project
Importing a Project

Each of these steps can be done separately via our , but it can also be done conveniently with a single SDK function call: dbnl.report_run_with_results, which is recommended. See below.

We also have an eval library available that lets you generate useful metrics on your columns and report them to dbnl alongside the Run results. Check out for more information.

In older versions of dbnl, the job of the schema was done by something called the "Run Config". The Run Config has been fully deprecated, and you should check the and update any code you have.

type: The type of the column, e.g. int. For a list of available types, see

Scalars represent any data that live at the Run level; that is, the represent single data points that apply to your entire Run. For example, you may want to calculate an for the entirety of a result set for your model. The scalar schema is also a list of objects, and takes on the same fields as the column schema above.

Components are defined within the components_dag field of the schema. This defines the topological structure of your app as a . Using this, you can tell dbnl which part of your application different columns correspond to, enabling a more granular understanding of your app's behavior.

You can learn more about creating a Run schema in the SDK reference for . There is also a , but we recommend the method shown in the .

Check out the section on to see how dbnl can supplement your results with more useful data.

There are functions to upload and in the SDK, but, again, we recommend the method in the !

Once you're finished uploading results to dbnl for your Run, the run should be closed, to mark it as ready to be used in Test Sessions. Note that reporting results to a Run will overwrite any existing results, and, once closed, the Run can no longer have results uploaded. If you need to close a Run, there is an for it, or you can close an open Run from its page on the UI.

dbnl platform architecture
Users and Permissions
JSON Web Tokens
tests you've created
Running Tests
SDK reference
F1 score
Directed Acyclic Graph (DAG)
metrics
via the SDK
Setting a Default Baseline Run
Runs
tests
Notifications
default Baseline Run
create_project
SDK
Putting it All Together
dbnl.create_run_schema
function to create a Run
section below
column results
scalar results
section below

Tests

Tests are the key tool within dbnl for asserting the behavior and consistency of Runs. Possible goals during testing can include:

  • Asserting that your application, holistically or for a chosen column, behaves consistently compared to a baseline.

  • Asserting that a chosen column meets its minimum desired behavior (e.g., inference throughput);

  • Asserting that a chosen column has a distribution that roughly matches a baseline reference;

What's in a Test?

At a high level, a Test is a statistic and an assertion. Generally, the statistic aggregates the data in a column or columns, and the assertion tests some truth about that aggregation. This assertion may check the values from a single Run, or it may check how the values in a Run have changed compared to a baseline. Some basic examples:

  1. Assert the 95th percentile of app_latency_ms is less than or equal to 180

Test Spec JSON
{
    "name": "p95_app_latency_ms",
    "description": "Test the 95th percentile of latency in miliseconds",
    "statistic_name": "percentile",
    "statistic_params": {"percentage": 0.95},
    "assertion": {
        "name": "less_than_or_equal_to",
        "params": {
            "other": 180.0,
        },
    },
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.app_latency_ms"
            }
        },
    ],
}
  1. Assert the absolute difference of median of positive_sentiment_score against the baseline is close to 0

Test Spec JSON
{
    "name": "median_sentiment_similar",
    "description": "Test the absolute difference of median on sentiment",
    "statistic_name": "abs_diff_median",
    "statistic_params": {},
    "assertion": {
        "name": "close_to",
        "params": {
            "other": 0.0,
            "tolerance": 0.01,
        },
    },
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.positive_sentiment_score"
            }
        },
        {
            "select_query_template": {
                "select": "{BASELINE}.positive_sentiment_score"
            }
        },
    ],
}

In the next sections, we will explore the objects required for testing alongside the methods for creating tests, running tests, reviewing/analyzing tests, and some best practices.

Terraform Module

Terraform module installation instructions

The Terraform module option provides maximum simplicity. It provisions all the required infrastructure and permissions in your cloud provider of choice before deploying the dbnl platform Helm chart, removing the need to provision any infrastructure or permission separately.

Prerequisites

The following prerequisite steps are required before starting the Terraform module installation.

Configuration

To configure the Terraform module, you will need:

  • A domain name to host the dbnl platform (e.g. dbnl.example.com).

An RSA key pair can be generated with:

Requirements

On the environment from which you are planning to install the module, you will need to:

Infrastructure

At a minimum, the user performing the installation needs to be able to provision the following infrastructure:

Soon.

Installation

Steps

The steps to install the Terraform module using the Terraform CLI are as follows:

  1. Create a dbnl folder and change to it.

  1. Create a modules folder and copy the terraform module to it.

  1. Create a variables.tf file.

  1. Create a main.tf file.

  1. Create a dbnl.tfvars file.

  1. Initialize the Terraform module.

  1. Apply the Terraform module.

Soon.

Options

For more details on all the installation options, see the Terraform module README file and examples folder.

OIDC Authentication

OIDC configuration options

Configuration

OIDC can be configured using the following options in the dbnl Helm chart or Terraform module:

  • audience

  • clientId

  • issuer

  • scopes

Instructions on how to get those options for each provider can be found below.

Creating Tests

As you become more familiar with the behavior of your application, you may want to build on the default App Similarity Index test with tests that you define yourself. Let's walk through that process.

Designing a Test

Context-Driven Test Creation

As you browse the dbnl UI, you will see "+" icons or "+ Add Test" buttons appear. These provide context-aware shortcuts for easily creating relevant tests.

At each of these locations, a test creation drawer will open on the right side of the page with several of the fields pre-populated based on the context of the button, alongside a history of the statistic, if relevant. Here are some of the best places to look for dbnl-assisted test creation:

Key Insights

Column or Metric Details

When inspecting the details of a column or metric from a Test Session, there are several "Add Test" buttons provided to allow you to quickly create a test on a relevant statistic. The Statistic History graph can help guide you on choosing a threshold.

Summary Statistics Table

When viewing a Run, each entry in the summary statistics table can be used to seed creation of a test for that chosen statistic.

And More!

These shortcuts appear in several other places in the UI as well when you are inspecting your Runs and Test Sessions; keep an eye out for the "+"!

Templated Tests

Test templates are macros for basic test patterns recommended by Distributional. It allows the user to quickly create tests from a builder in the UI. Distributional provides five classes of test templates:

From the Test Configuration tab on your Project, click the dropdown next to "Add Test".

Select from one of the five options. A Test Creation drawer will appear and the user can edit the statistic, column, and assertion that they desire. Note that each Test Template has a limited set of statistics that it supports.

Creating Tests Manually (Advanced)

If you have a good idea of what you want to test or just want to explore, you can create tests manually from either the UI or via the Python SDK.

Let's say you are building an Q&A chatbot, and you have a column for the length of your bot's responses, word_count. Perhaps you want to ensure that your bot never outputs more than 100 words; in that case, you'd choose:

  • The statistic max,

  • The assertion less than or equal to ,

  • and the threshold 100.

But what if you're not opinionated about the specific length? You just want to ensure that your app is behaving consistently as it runs and doesn't suddenly start being unusually wordy or terse. dbnl makes it easy to test that as well; you might go with:

  • The statistic absolute difference of mean,

  • The assertion less than,

  • and the threshold 20.

Now you're ready to go and create that test, either via the UI or the SDK:

From your Project, click the "Test Configuration" tab.

Next to the "My Tests" header, you can click "Add Test" to open the test creation page, which will enable you to define your test through the dropdown menu on the left side of the window.

By default, your Project will be pre-populated with a test for the first goal above. This is the "App " test which gives you a quick understanding of whether your application's behavior has significantly deviated from a selected baseline.

Terraform modules are available for AWS and GCP. For access to the Terraform module for your cloud provider of choice and to get registry credentials, .

A set of dbnl registry credentials to pull the (e.g. Docker images, Helm charts).

An RSA key pair to sign the .

Install

Install

Install

(IAM)

(VPC)

(EKS)

(ACM)

(ALB)

The Terraform module can be installed using . We recommend using a to manage the Terraform state.

The dbnl platform uses or OIDC for authentication. OIDC providers that are known to work with dbnl include:

  1. Follow the to create a new SPA (single page application).

    1. In Settings > Application URIs, add the dbnl deployment domain to the list of Allowed Callback URLs (e.g. dbnl.mydomain.com).

  2. Navigate to Settings > Basic Information and copy the Client ID as the OIDC clientId option.

  3. Navigate to Settings > Basic Information and copy the Domain and prepend with https:// to use as the OIDC issuer option (e.g. https://my-app.us.auth0.com/).

  4. Follow the to create a custom API.

    1. Use your dbnl deployment domain as the Identifier (e.g. dbnl.mydomain.com).

  5. Navigate to Settings > General Settings and copy the Identifier as the OIDC audience option.

  6. Set the OIDC scopes option to "openid profile email".

  1. Follow the to create a new SPA (single page application) and enable OIDC.

    1. Add the dbnl deployment domain as the callback URL (e.g. dbnl.mydomain.com).

  2. [Optional] Follow the to restrict access to certain users.

  3. Navigate to App Registrations > (Application) > Manage > API permissions and add the Microsoft Graph email, openid and profile permissions to the application.

  4. Navigate to App Registrations > (Application) > Manage > Manifest and set access token version to 2.0 with "accessTokenAcceptedVersion": 2 .

  5. Navigate to App Registrations > (Application) > Manage > Token configuration > Add optional claim > Access > email to add the email optional claim to the access token type.

  6. Navigate to App Registrations > (Application) and copy the Application (client) ID (APP_ID) to be used as the OIDC clientId and OIDC audience options.

  7. Set the OIDC issuer option to https://login.microsoftonline.com/{APP_ID}/v2.0 .

  8. Set the OIDC scopes option to "openid email profile {APP_ID}/.default".

    1. Set the Sign-in redirect URIs to your dbnl domain (e.g. dbnl.mydomain.com)

  1. Navigate to General > Client Credentials and copy the Client ID to be used as the OIDC clientId option.

  2. Navigate to Sign on > OpenID Connect ID Token and copy the Issuer URL to be used as the OIDC issuer and OIDC audience options.

  3. Set the OIDC scopes option to "openid email profile" .

The first step in coming up with a test is determining what behavior you're interested in. As described in , each Run of your application reports its behavior via its results, which are organized into columns (and scalars). Once you've identified the column or scalar you'd like to test on, then you need to determine what you'd like to apply to it and the you'd like to make on that statistic.

This might seem like a lot, but dbnl has your back! While you can define tests , dbnl has several ways of helping you identify what columns you might be interested in and letting you quickly define tests on them.

When creating a test, you can specify tags to apply to it. You can use these tags to filter which tests you want to include or exclude later when . Some of the test creation shortcuts on the UI do not currently allow specifying tests, but you can edit the test and add tags after the fact.

When you're looking at a Test Session, dbnl will provide insights about which columns or metrics have have demonstrated the most drift. These are great candidates to define tests on if you want to be specificially alerted about their behavior. You can click the "Add Test" button to create a test on the of the relevant column. The Similarity Index history graph can help guide you on choosing a threshold.

: These are parametric statistics of a column.

: These test if the absolute difference of a statistic of a column between two runs is less than a threshold.

: These test if the column from two different runs are similarly distributed is using a nonparametric statistic.

: These are tests on the row-wise absolute difference of result

: These tests the signed difference of a statistic of a column between two runs

When creating a test manually, you can also specify filters to apply the test only to specific rows within your Runs. Check out for more information.

On the left side you can configure your test by choosing a statistic and assertion. Note that you can use our builder or build a test spec with raw JSON (you can see some example test spec JSONs ). On the right, you can browse the data of recent Runs to help you figure out what statistics and thresholds are appropriate to define acceptable behavior.

Tests can be using the python SDK. Users must provide a JSON dictionary that adheres to the dbnl Test Spec, which is described in the previous link and has an example provided below.

You can see a full list with descriptions of available statistics and assertions .

Follow the to create a new SPA (single page application) and enable OIDC.

openssl genrsa -out dbnl_dev_token_key.pem 2048
mkdir dbnl
cd dbnl
mkdir modules
cp -R /path/to/dbnl/module modules/terraform-aws-dbnl
variable "oidc_audience" {
  type        = string
  description = "OIDC audience."
}

variable "oidc_client_id" {
  type        = string
  description = "OIDC client id."
}

variable "oidc_issuer" {
  type        = string
  description = "OIDC issuer."
}

variable "oidc_scopes" {
  type        = string
  description = "OIDC scopes."
  default     = "openid profile email"
}

variable "domain" {
  description = "Domain to deploy to."
  type        = string
}

variable "dev_token_private_key_pem" {
  type        = string
  description = "Dev token private key PEM."
  sensitive   = true
}

variable "registry_username" {
  type        = string
  description = "Artifact registry username."
  sensitive   = true
}

variable "registry_password" {
  type        = string
  description = "Artifact registry password."
  sensitive   = true
}
provider "aws" {
  # Configure AWS provider with target AWS account.
}

provider "kubernetes" {
  host                   = module.dbnl.cluster_endpoint
  cluster_ca_certificate = base64decode(module.dbnl.cluster_ca_cert)
  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    args        = ["eks", "get-token", "--cluster-name", module.dbnl.cluster_name]
    command     = "aws"
  }
}

provider "helm" {
  kubernetes {
    host                   = module.dbnl.cluster_endpoint
    cluster_ca_certificate = base64decode(module.dbnl.cluster_ca_cert)
    exec {
      api_version = "client.authentication.k8s.io/v1beta1"
      args        = ["eks", "get-token", "--cluster-name", module.dbnl.cluster_name]
      command     = "aws"
    }
  }
}

module "dbnl" {
  source = "./modules/terraform-aws-dbnl"

  instance_size = "medium"
  
  oidc_audience  = var.oidc_audience
  oidc_client_id = var.oidc_client_id
  oidc_issuer    = var.oidc_issuer
  oidc_scopes    = var.oidc_scopes

  domain = var.domain
  
  dev_token_private_key = var.dev_token_private_key_pem
    
  registry_username = var.registry_username
  registry_password = var.registry_password
}
# For more details on OIDC options, see OIDC Authentication section.
oidc_audience  = "oidc.example.com"
oidc_client_id = "xxxxxxxx"
oidc_issuer    = "yyyyyyyy"
oidc_scopes    = "openid email profile"

domain = "dbnl.example.com"
terraform init
terraform apply \
    -var-file="dbnl.tfvars" \
    -var="dev_token_private_key=${DBNL_DEV_TOKEN_PRIVATE_KEY}" \
    -var="registry_username=${DBNL_REGISTRY_USERNAME}" \
    -var="registry_password=${DBNL_REGISTRY_PASSWORD}"
import dbnl
dbnl.login()

proj = dbnl.get_or_create_project(name="My Project")

dbnl.experimental.create_test(
    test_spec_dict={
        "project_id": proj.id,
        "name": "Word count difference",
        "description": "Test the absolute difference of mean on word_count",
        "statistic_name": "abs_diff_mean",
        "statistic_params": {},
        "assertion": {
            "name": "less_than",
            "params": {
                "other": 20.0,
            },
        },
        "statistic_inputs": [
            {
                "select_query_template": {
                    "select": "{EXPERIMENT}.word_count"
                }
            },
            {
                "select_query_template": {
                    "select": "{BASELINE}.word_count"
                }
            },
        ],
    }
)
Single Run
Similarity of Statistics
Similarity of Distributions
Similarity of Results
Difference of Statistics

Networking

List of networking requirements

Ingress

Requirements

The dbnl platform needs to be hosted on a domain or subdomain (e.g. dbnl-example.com or dbnl.example.com). It cannot be hosted on a subpath.

HTTPS/SSL

It is recommended that the dbnl platform be served over HTTPS. Support for SSL termination at the load balancer is included.

Egress

Requirements

Currently, the dbnl platform cannot run in an air-gapped environment and requires a few URLs to be accessible via egress.

Artifacts Registry

Required to fetch the dbnl paltform artifacts such as the Helm chart and Docker images.

  • https://us-docker.pkg.dev/dbnlai/

Object Store

Required for services to access the object store.

  • https://{BUCKET}.s3.amazonaws.com/​ (if using S3)

  • https://storage.googleapis.com/{BUCKET} (if using GCS)

OIDC

Required to validate OIDC tokens.

  • https://login.microsoftonline.com/{APP_ID}/v2.0/ (if using Microsoft EntraID)

  • https://{ACCOUNT}.okta.com/ (if using Okta)

Integrations

Required to use some integrations.

  • https://events.pagerduty.com/v2/enqueue​ (if using PagerDuty)

  • https://hooks.slack.com/services/ (if using Slack)

Functions

abs

Returns the absolute value of the input.

Syntax

abs(expr)

add

Adds the two inputs.

Syntax

add(expr1, expr2)

and

Logical and operation of two or more boolean columns.

Syntax

and(expr1, expr2)

automated_readability_index

Returns the ARI (Automated Readability Index) which outputs a number that approximates the grade level needed to comprehend the text. For example if the ARI is 6.5, then the grade level to comprehend the text is 6th to 7th grade.

Syntax

automated_readability_index(expr)

bleu

Computes the BLEU score between two columns.

Syntax

bleu(expr1, expr2)

character_count

Returns the number of characters in a text column.

Syntax

character_count(expr)

Aliases

  • num_chars

divide

Divides the two inputs.

Syntax

divide(expr1, expr2)

equal_to

Computes the element-wise equal to comparison of two columns.

Syntax

equal_to(expr1, expr2)

Aliases

  • eq

filter

Filters a column using another column as a mask.

Syntax

filter(expr1, expr2)

flesch_kincaid_grade

Returns the Flesch-Kincaid Grade of the given text. This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document.

Syntax

flesch_kincaid_grade(expr)

greater_than

Computes the element-wise greater than comparison of two columns. input1 > input2

Syntax

greater_than(expr1, expr2)

Aliases

  • gt

greater_than_or_equal_to

Computes the element-wise greater than or equal to comparison of two columns. input1 >= input2

Syntax

greater_than_or_equal_to(expr1, expr2)

Aliases

  • gte

is_valid_json

Returns true if the input string is valid json.

Syntax

is_valid_json(expr)

less_than

Computes the element-wise less than comparison of two columns. input1 < input2

Syntax

less_than(expr1, expr2)

Aliases

  • lt

less_than_or_equal_to

Computes the element-wise less than or equal to comparison of two columns. input1 <= input2

Syntax

less_than_or_equal_to(expr1, expr2)

Aliases

  • lte

levenshtein

Returns Damerau-Levenshtein distance between two strings.

Syntax

levenshtein(expr1, expr2)

list_has_duplicate

Returns True if the list has duplicated items.

Syntax

list_has_duplicate(expr)

list_len

Returns the length of lists in a list column.

Syntax

list_len(expr)

list_most_common

Most common item in list.

Syntax

list_most_common(expr)

multiply

Multiplies the two inputs.

Syntax

multiply(expr1, expr2)

negate

Returns the negation of the input.

Syntax

negate(expr)

not

Logical not operation of a boolean column.

Syntax

not(expr)

not_equal_to

Computes the element-wise not equal to comparison of two columns.

Syntax

not_equal_to(expr1, expr2)

Aliases

  • neq

or

Logical or operation of two or more boolean columns.

Syntax

or(expr1, expr2)

rouge1

Returns the rouge1 score between two columns.

Syntax

rouge1(expr1, expr2)

rouge2

Returns the rouge2 score between two columns.

Syntax

rouge2(expr1, expr2)

rougeL

Returns the rougeL score between two columns.

Syntax

rougeL(expr1, expr2)

rougeLsum

Returns the rougeLsum score between two columns.

Syntax

rougeLsum(expr1, expr2)

sentence_count

Returns the number of sentences in a text column.

Syntax

sentence_count(expr)

Aliases

  • num_sentences

subtract

Subtracts the two inputs.

Syntax

subtract(expr1, expr2)

token_count

Returns the number of tokens in a text column.

Syntax

token_count(expr)

word_count

Returns the number of words in a text column.

Syntax

word_count(expr)

Aliases

  • num_words

please reach out to our team
dbnl artifacts
personal access tokens
kubectl
helm
terraform
AWS Identity & Access Management
Amazon RDS for PostgreSQL
Amazon Virtual Private Cloud
Amazon S3
Amazon Elastic Kubernetes Service
AWS Certificate Manager
Amazon Elastic Load Balancing
terraform apply
remote backend
OpenID Connect
Auth0
Microsoft Entra ID
Okta
Okta instructions
Using Filters in Tests
here
manually
here
Similarity Index
Similarity Index
creating a Test Session
the section on Runs
statistic
assertion

Sandbox

Instructions for managing a dbnl Sandbox deployment.

The dbnl sandbox deployment bundles all of the dbnl services and dependencies into a single self-contained Docker container. This container replicates a full scale dbnl deployment by creating a Kubernetes cluster in the container and using Helm to deploy the dbnl platform and its dependencies (postgresql, redis and minio).

The sandbox deployment is not suitable for production environments.

Requirements

  • The sandbox container needs access to the following two registries to pull the containers for the dbnl platform and its dependencies.

    • us-docker.pkg.dev

    • docker.io

  • The sandbox container needs sufficient memory and disk space to schedule the k3d cluster and the containers for the dbnl platform and its dependencies.

Registry Credentials

Usage

Although the sandbox image can be deployed manually using Docker, we recommend using the dbnl CLI to manage the sandbox container. For more details on the sandbox CLI options, run:

$ dbnl sandbox --help

Start the Sandbox

To start the dbnl Sandbox, run:

$ dbnl sandbox start -p ${REGISTRY_PASSWORD}

This will start the sandbox in a Docker container named dbnl-sandbox. It will also create a Docker volume of the same name to persist data beyond the lifetime of the sandbox container.

Stop the Sandbox

To stop the dbnl sandbox, run:

$ dbnl sandbox stop

This will stop and remove the sandbox container. It does not remove the Docker volume and the next time the sandbox is started, it will remount the existing volume, persisting the data beyond the lifetime of the Sandbox container.

Get Sandbox Status

To get the status of the dbnl sandbox, run:

$ dbnl sandbox status

Get Sandbox Logs

To tail the dbnl sandbox logs, run:

$ dbnl sandbox logs

Execute Command in Sandbox

To execute a command in the dbnl sandbox, run:

$ dbnl sandbox exec [COMMAND]

This will execute COMMAND within the dbnl sandbox container. This is a useful tool for debugging the state of the containers running within the sandbox containers. For example:

To get a list of all Kubernetes resources, run:

$ dbnl sandbox exec kubectl get all

To get the logs for a particular pod, run:

$ dbnl sandbox exec kubectl logs [POD]

Delete Sandbox Data

This is an irreversible action. All the sandbox data will be lost forever.

To delete the sandbox data, run:

$ dbnl sandbox delete

Authentication

The sandbox deployment uses username and password authentication with a single user. The user credentials are:

  • Username: admin

  • Password: password

Storage

The sandbox persists data in a Docker volume named dbnl-sandbox. This volume is persisted even if the sandbox is stopped, making it possible to later resume the sandbox without losing data.

Remote Sandbox

If deploying and hosting the sandbox on a remote host, such as on EC2 or Compute Engine, the sandbox --base-url option needs to be set on start.

For example, if hosting the sandbox on http://example.com:8080, the sandbox needs to be started with:

$ dbnl sandbox start --base-url http://example.com:8080

Currently, the sandbox does not support being hosted from a subpath (e.g. http://example.com:8080/dbnl) or being served from a different port. If those are required, we recommend using a reverse proxy.

Reviewing Tests

The Test Sessions section in your Project is a record of all the Test Sessions you've created. You can view a line chart of the pass rate of your Test Sessions over time or view a table with each row representing a Test Session You can click on a point in the chart or a row in the table to navigate to the corresponding Test Session's detail page to dig into what happened within that session.

Test Session Details

When you first open a Test Session's page, you will land on the Summary tab. This tab provides you with summary information about the session such as the App Similarity Index, which tests have failed, and key insights about the session. There are also tabs to see the Similarity Report (more information below) or to view all the test results within the session.

Similarity Indexes

By default, dbnl creates a App Similarity Index test in your project. This tests that the Similarity Index for your application is over 80.

Key Insights

On the Summary tab, you'll notice a list of key insights that dbnl has discovered about your Test Session. The key insights will tell you at a glance which columns or metrics have had the most significant change in your Experiment Run when compared to the baseline. If you are particularly interested in the column or metric going forward, you can quickly add a test for its Similarity Index.

Expanding one of these will allow you to view some additional information such as a history of the Similarity Index for the related column or metric; if you are viewing a metric, it will also tell you the lineage of which columns the metric is derived from.

Similarity Report

The Similarity Report gives you an overview of all the columns your Experiment Run, providing the relevant Similarity Indexes, the ability to quickly create tests from them, and the option to deep-dive into a column. Expanding one of the rows for a column for show you all the metrics calculated for that column, with their own respective Similarity Indexes and details.

If you click on the "See Details" link on any of these rows (or from the Key Insights view), you'll be taken to a view that lets you explore the respective column or metric in detail.

From this view, you can easily compare the changes in the column/metric with graphs and summary statistics. Expanding of the comparison statistics will give you even more information to dig into! Click "Add Test" to quickly create a test on the related statistic.

Running Tests

You can run any tests you've created (or just the default App Similarity Index test) to investigate the behavior of your application.

Running Your Tests

When you run a Test Session, you are running your tests against a given Experiment Run.

Choose a Baseline Run

Create a Test Session

Tests are run within the context of a Test Session, which is effectively just a collection of tests run against an Experiment Run with a Baseline Run. You can create a Test Session, which will immediately run the tests, via the UI or the SDK:

Regardless of how you choose to create your Test Session, you can specify tags to choose a subset of tests to run in that given session. The following options for tags are available:

  • Include Tags: Only tests with any of these tags will be run

  • Exclude Tags: Only tests with none of these tags will be run

  • Required Tags : Only tests with every one of these tags will be run

Available Statistics and Assertions

Statistics

Statistic
Description

absolute difference of max

-

absolute difference of mean

-

absolute difference of median

-

absolute difference of min

-

absolute difference of percentile

Requires percentage as a parameter.

absolute difference of standard deviation

-

absolute difference of sum

-

Category Rank Discrepancy

Computes the absolute difference in the proportion of the specified category between the experiment and baseline runs. The category is specified by its rank in the baseline run.

Requires rankas a parameter: can be one of [most_common, second_most_common, not_top_two].

Chi-squared stat, scaled

Kolmogorov-Smirnov stat, scaled

max

-

mean

-

median

-

min

-

mode

-

Null Count

Computes the number of None values in a column.

Null Percentage

Computes the fraction of None values in a column.

percentile

Requires percentage as a parameter.

scalar

signed difference of max

-

signed difference of mean

-

signed difference of median

-

signed difference of min

-

signed difference of percentile

Requires percentage as a parameter.

signed difference of standard deviation

-

signed difference of sum

-

standard deviation

-

sum

-

Assertions

Assertion

between

between or equal to

close to

equal to

greater than

greater than or equal to

less than

less than or equal to

not equal to

outside

outside or equal to

Query Language

An overview of the dbnl Query Language

The dbnl Query Language is a SQL-like language that allows for querying data in runs for the purpose of drawing visualizations, defining metrics or evaluating tests.

Expressions

An expression is a combination of literals, values, operators, and functions. Expressions can evaluate to scalar or columnar values depending on their types and inputs. There are three types of expressions that can be composed into arbitrarily complex expressions.

Literal Expressions

Literal expressions are constant-valued expressions.

Column and Scalar Expressions

Column and scalar expressions are references to columns or scalar values in a run. They use dot-notation to reference a column or scalar within a run.

For example, a column named score in a run with id run_1234 can be referenced with the expression:

Function Expressions

Function expressions are functions evaluated over zero or more other expressions. They make it possible to compose simple expressions into arbitrarily complex expressions.

For example, the word_count function can be used to compute the word count of the text column in a run with id run_1234 with the expression:

Operators

Operators are aliases for function expressions that enhance readability and ease of use. Operator precedence is the same as that of most SQL dialect.

Arithmetic operators

Arithmetic operators provide support for basic arithmetic operations.

Comparison operators

Comparison operators provide support for common comparison operations.

Logical operators

Logical operators provide support for boolean comparisons.

Null Semantics

The dbnl Query Language follows the null semantics of most SQL dialect. With a few exception, when a null value is used as an input to a function or operator, the result is null.

One exception to this is boolean functions and operators where ternary logic is used similar to most SQL dialects.

Python SDK

The primary mechanism for submitting data to Distributional is through our Python SDK. This section servers as a reference for the various functionalities available for interacting with dbnl via the SDK.

Example Usage

Auth0 instructions
Auth0 instructions
Microsoft Entra ID instructions
Microsoft Entra ID instructions
Test Session Summary Page
Column or Metric Similarity Page
Run Details Page
Test Creation Drawer
The Create Test page

Install .

Install , the dbnl CLI and Python SDK.

Within the sandbox container, is used in conjunction with to schedule the containers for the dbnl platform and its dependencies.

The dbnl sandbox image and the dbnl platform images are stored in a private registry. For access, .

Once ready, the dbnl UI will be accessible at .

To use the dbnl Sandbox, set your API URL to , either through or through the .

This will tail the logs from the container. This does not include the logs from the services that run on the Kubernetes cluster within the container. For this, you will need to use the .

Once you've , you can check it out in the UI!

The Test Sessions tab on your Project page displays a summary of all past Test Sessions for the Project.
Test Session Summary

Across the Test Session page, you will see Similarity Indexes at both an "App" level as well as on each of your columns and metrics. This is a special summary score that dbnl calculates for you to help you quickly and easily understand how much your app has changed between the Experiment and Baseline Runs within the session, both holistically and at a granular level. You can define tests on any of the indexes — at the app level or on a specific metric or column. For more information, see the section "".

Key Insight expansion
Test Session Similarity Report

If you haven't already, take a look at the documentation on . All the methods for running a test will allow you to choose a Baseline Run at the time of Test Session creation, but you can also .

You can choose to run the tests associated with a Project by clicking on the "Run Tests" button on your Project. This button will open up a modal that allows you to specify the Baseline and Experiment Runs, as well as the tags of the tests you would like to include or exclude from the test session.

Tests can be run via the SDK function . Most likely, you will want to create a Test Session shortly after you've reported and closed a Run. See for more information.

import dbnl
import pandas as pd
dbnl.login()

# More likely, you will use the run reference returned by
# report_run_with_results. See the "Reporting Runs" section in the
# docs (linked above) for more information.
run = dbnl.get_run(run_id="run_abc123")

# See the create_test_session reference documentation (linked above)
# for more options, like overriding the baseline or specifying tags
# to choose a subset of test to run
dbnl.create_test_session(
  experiment_run=run,
)

Continue onto for how to look at and interpret the results from your Test Session.

Computes a scaled and normalized statistic between two nominal distributions.

Computes a scaled and normalized statistic between two ordinal distributions.

Special function for using in tests. Returns the input as a scalar value if it is a scalar and returns an error otherwise.

Type
Example
Operator
Function
Description
Operator
Function
Description
Operator
Function
Description
Expression
Result
a
b
a or b
a and b
not a

Below is a basic working example that highlights the SDK workflow. If you have not yet installed the SDK, follow .

docker
dbnl
k3d
docker-in-docker
please reach out to our team
http://localhost:8080
What Is a Similarity Index?
Reviewing Tests
http://localhost:8080/api
SDK
environment variables
exec command
created a Test Session
setting a Baseline Run
set a default

boolean

true

int

42

float

1.0

string

'hello world'

run_1234.score
word_count(run_1234.text)

-a

negate(a)

Negate an input.

a * b

multiply(a, b)

Multiply two inputs.

a / b

divide(a, b)

Divide two inputs.

a + b

add(a, b)

Add two inputs.

a - b

subtract(a, b)

Subtract two inputs.

a = b

eq(a, b)

Equal to.

a != b

neq(a, b)

Not equal to.

a < b

lt(a, b)

Less than.

a <= b

lte(a, b)

Less than or equal to.

a > b

gt(a, b)

Greater than.

a >= b

gte(a, b)

Greater than or equal to

not b

not(a, b)

Logical not of input.

a and b

and(a, b)

Logical and of two inputs.

a or b

or(a, b)

Logical or of two inputs.

4 > null

null

null = null

null

null + 2

null

word_count(null)

null

true

null

true

null

false

false

null

null

false

true

null

true

true

null

null

null

false

null

false

null

null

null

null

null

null

import dbnl
import numpy as np
import pandas as pd


num_results = 500
test_data = pd.DataFrame({
    "idx": np.arange(num_results),
    "age": np.random.randint(5,95, num_results),
    "loc": np.random.choice(["NY", "CA", "FL"], num_results),
    "churn_gtruth": np.random.choice([True, False], num_results)
})

# example user ml app / model
def churn_predictor(input):
    return 1.0 / (1.0 + np.exp(-(input["age"] / 10.0 - 3.5)))

def evaluate_predicton(data):
    return (data["churn_score"] > 0.5 and data["churn_gtruth"]) or \
            (data["churn_score"] < 0.5 and not data["churn_gtruth"])

test_data["churn_score"] = test_data.apply(churn_predictor, axis=1)
test_data["pred_correct"] = test_data.apply(evaluate_predicton, axis=1)
test_data = test_data.astype({"age": "int", "loc": "category"})

# Use DBNL
dbnl.login(api_token="<COPY_PASTE_DBNL_API_TOKEN>")
proj = dbnl.get_or_create_project(name="example_churn_predictor")
run = dbnl.report_run_with_results(
    project=proj,
    column_data=test_data,
    row_id=["idx"],
)
Literal expression
Column expression
Function expression
Operators
Chi-squared
Kolmogorov-Smirnov
scalars
these instructions

Using Filters in Tests

Filters can be used to specify a sub-selection of rows in Runs you would like to be tested.

For example, you might want to create a test that asserts that the absolute difference of means of the correct churn predictions is <= 0.2 between Baseline and Experiment Runs, only for rows where the loc column is NY.

Apply a Filter to a Test

Once you've used one of the methods above, you can now see the new test in the Test Configuration tab of your Project.

When a Test Session is created, this test will use the defined filters to sub-select for the rows that have the loc column equal to NY.

Notifications

Notifications provide a way for users to be automatically notified about critical test events (e.g., failures or completions) via third-party tools like PagerDuty and Slack.

With Notifications you can:

  • Add critical test failure alerting to your organization’s on-call

  • Create custom notifications for specific feature tests

  • Stay informed when a new test session has started

What's in a Notification?

A Notification is composed of two major elements:

  • The Notification Channel — this contains the metadata for how and where a Notification will be sent

  • The Notification Criteria — this defines the rules for when a Notification will be generated

Setting up a Notification Channel in your Namespace

Before setting up a Notification in your project, you must have a Notification Channel set up in your Namespace. A Notification Channel describes who will be notified and how. A Notification Channel in a Namespace can be used by Notifications across all Projects belonging to that Namespace.

  1. In your desired Namespace, choose Notification Channels in the menu sidebar. Note: you must be a Namespace admin in order to do this.

  2. Click the New Notification Channel button to navigate to the creation form.

  3. Fill out the appropriate fields.

    1. Optional: If you’d like to test that your Notification Channel is set up correctly, click the Test button. If it is correctly set up, you should receive a notification through the integration you’ve selected.

  4. Click the Create Notification Channel button. Your channel will now be available when setting up your Notification.

Supported Third-Party Notification Channels

Note: More coming up in the product roadmap!

Setting up a Notification in your Project

  1. Navigate to your Project and click the Notifications tab.

  1. Click the "New Notification" button to navigate to the creation form.

  1. Click the "Create Notification" button. Your Notification will now notify you when your specified criteria are met!

Notification Criteria

Trigger Event

The trigger event describes when your Notification is initiated. Trigger events are based on Test Session outcomes.

Tags Filtering

Filtering by Tags allows you to define which tests in the Test Session you care to be notified about.

There are three types of Tags filters you can provide:

Include: Must have ANY of the selected

Exclude: Must not have ANY of the selected

Require: Must have ALL of the selected

When multiple types are provided, all filters are combined using ‘AND’ logic, meaning all conditions must be met simultaneously.

Note: This field only pertains to the ‘Test Session Failed’ trigger event

Condition

The condition describes the threshold at which you care to be notified. If the condition is met, your Notification will be sent.

Note: This field only pertains to the ‘Test Session Failed’ trigger event

Organization and Namespaces

Resources in the dbnl platform are organized using organizations and namespaces.

Organization

An organization, or org for short, corresponds to a dbnl deployment.

Organization Resources

Some resources, such as users, are defined at the organization level. Those resources are sometimes referred to as organization resources or org resources.

Namespaces

A namespace is a unit of isolation within a dbnl organization.

Namespace Resources

Most resources, including projects and their related resources, are defined at the namespace level. Resources defined within a namespace are only accessible within that namespace providing isolation between namespaces.

Default Namespace

All organizations include a namespace named default. This namespace cannot be modified or deleted.

By default, users are assigned the namespace reader role in the default namespace.

Switching Namespace

To switch namespace, use the namespace switcher in the navigation bar.

Creating a Namespace

To create a namespace, go to ☰ > Settings > Admin > Namespaces and click the + Create Namespace button.

Creating a namespace requires having the org admin role.

Adding a User to a Namespace

Access Controls

The following section introduces the concepts used to control access to the dbnl platform.

Self-hosted

An overview of the self-hosted deployment options

The self-hosted deployment option allows you to deploy the dbnl platform directly in your cloud or on-premise environment.

Reporting Runs

Navigate to the and create the test with the filter specified on the baseline and experiment run.

Filter for the baseline Run:

equal_to({BASELINE}.loc, 'NY')

Filter for the experiment Run:

equal_to({EXPERIMENT}.loc, 'NY')
import dbnl
dbnl.login()

proj = dbnl.get_or_create_project(name="My Project")

dbnl.experimental.create_test(
    test_spec_dict={
        "project_id": proj.id,
        "name": "abs diff of mean of correct churn preds of NY users is within 0.2",
        "statistic_name": "abs_diff_mean",
        "statistic_params": {},
        "assertion": {
            "name": "less_than_or_equal_to",
            "params": {
                "other": 0.2
            },
        },
        "statistic_inputs": [
            {
                "select_query_template": {
                    "select": "{BASELINE}.pred_correct",
                    "filter": "equal_to({BASELINE}.loc, 'NY')"
                }
            },
            {
                "select_query_template": {
                    "select": "{EXPERIMENT}.pred_correct",
                    "filter": "equal_to({EXPERIMENT}.loc, 'NY')"
                }
            },
        ],
    }
)

In adding your Notification Channel, you will be able to select which you'd like to be notified through.

Notification Channel creation form

Notification creation form

Set your Notification’s name, , and Notification Channels.

See .

You can create a test with filters in the SDK via the function:

PagerDuty
Slack
third-party integration
criteria
Namespace
Organization

Organization and Namespaces

Users and Permissions

Tokens

dbnl.util

get_column_schemas_from_dataframe

dbnl.util.get_column_schemas_from_dataframe(df: DataFrame) → list[RunSchemaColumnSchemaDict]

get_default_components_dag_from_column_schemas

dbnl.util.get_default_components_dag_from_column_schemas(column_schemas: Sequence[ColumnSchemaDict]) → dict[str, list[str]] | None

Gets the unconnected components DAG from a list of column schemas. If there are no components, returns None. The default components dag is of the form {

“component1”: [], “component2”: [], …}

  • Parameters:column_schemas – list of column schemas

  • Returns: dictionary of components DAG or None

get_run_config_column_schemas_from_dataframe

dbnl.util.get_run_config_column_schemas_from_dataframe(df: DataFrame) → list[RunConfigPrimitiveColumnSchemaDict | RunConfigContainerColumnSchemaDict]

get_run_config_scalar_schemas_from_dataframe

dbnl.util.get_run_config_scalar_schemas_from_dataframe(df: DataFrame) → list[RunConfigPrimitiveScalarSchemaDict | RunConfigContainerScalarSchemaDict]

get_run_schema_columns_from_dataframe

dbnl.util.get_run_schema_columns_from_dataframe(df: DataFrame) → list[RunSchemaColumnSchema]

get_run_schema_scalars_from_dataframe

dbnl.util.get_run_schema_scalars_from_dataframe(df: DataFrame) → list[RunSchemaScalarSchema]

get_scalar_schemas_from_dataframe

dbnl.util.get_scalar_schemas_from_dataframe(df: DataFrame) → list[RunSchemaScalarSchemaDict]

make_test_session_input

dbnl.util.make_test_session_input(*, run: Run | None = None, run_query: RunQuery | None = None, run_alias: str = 'EXPERIMENT') → TestSessionInput

Create a TestSessionInput object from a Run or a RunQuery. Useful for creating TestSessions right after closing a Run.

  • Parameters:

    • run – The Run to create the TestSessionInput from

    • run_query – The RunQuery to create the TestSessionInput from

    • run_alias – Alias for the Run, must be ‘EXPERIMENT’ or ‘BASELINE’, defaults to “EXPERIMENT”

  • Raises:DBNLInputValidationError – If both run and run_query are None

  • Returns: TestSessionInput object

Users and Permissions
Architecture
Deployment
Networking
OIDC Authentication
Data Security

Helm Chart

Helm chart installation instructions

The Helm chart option separates the infrastructure and permission provisioning process from the dbnl platform deployment process, allowing you to manage the infrastructure, permissions and Helm chart using their existing processes.

Prerequisites

The following prerequisite steps are required before starting the Helm chart installation.

Infrastructure

To successfully deploy the dbnl Helm chart, you will need the following infrastructure:

Configuration

To configure the dbnl Helm chart, you will need:

  • A hostname to host the dbnl platform (e.g. dbnl.example.com).

  • A set of dbnl registry credentials to pull the dbnl artifacts (e.g. Docker images, Helm chart).

An RSA key pair can be generated with:

openssl genrsa -out dbnl_dev_token_key.pem 2048

Requirements

To install the dbnl Helm chart, you will need:

Permissions

For the services deployed by the Helm chart to work as expected, they will need the following permissions and network accesses:

  • api-srv

    • Network access to the database.

    • Network access to the Redis database.

    • Permission to read, write and generate pre-signed URLs on the object store bucket.

  • worker-srv

    • Network access to the database.

    • Network access to the Redis database.

    • Permission to read and write to the object store bucket.

Installation

Steps

The steps to install the Helm chart using the Helm CLI are as follows:

  1. Create an image pull secret with the your dbnl registry credentials.

kubectl create secret docker-registry dbnl-docker-cfg \
    --docker-server="us-docker.pkg.dev/dbnlai/images" \
    --docker-username="${DBNL_REGISTRY_USERNAME}" \
    --docker-password="${DBNL_REGISTRY_PASSWORD}"
  1. Create a minimal values.yaml file.

imagePullSecrets:
  - name: dbnl-docker-cfg
  
auth:
  # For more details on OIDC options, see OIDC Authentication section.
  oidc:
    enabled:   true
    issuer:    oidc.example.com
    audience:  xxxxxxxx
    clientId:  xxxxxxxx
    scopes:    "openid email profile"

db:
  host: db.example.com
  port: 5432
  username: user
  password: password
  database: database

redis:
  host: redis.example.com
  port: 6379
  username: user
  password: password

ingress:
  enabled: true
  api:
    host: dbnl.example.com
  ui:
    host: dbnl.example.com

storage:
  s3:
    enabled: true
    region: us-east-1
    bucket: example-bucket
  1. Log into the dbnl Helm registry.

helm login \
    --username ${DBNL_REGISTRY_USERNAME} \
    --password ${DBNL_REGISTRY_PASSWORD} \
    us-docker.pkg.dev/dbnlai/charts
  1. Install the Helm chart.

helm upgrade \
    --install \
    -f values.yaml \
    dbnl oci://us-docker.pkg.dev/dbnlai/charts/dbnl

Options

For more details on all the installation options, see the Helm chart README and values.yaml files. The chart can be inspected with:

helm show all oci://us-docker.pkg.dev/dbnlai/charts

Data Security

An overview of data access controls.

Data for a run is split between the object store (e.g. S3, GCS) and the database.

  • Metadata (e.g. name, schema) and aggregate data (e.g. summary statistics, histograms) are stored in the database.

  • Raw data is stored in the object store.

Database

Database access is always done through the API with the API enforcing access controls to ensure users only access data for which they have permission.

Object Store

When uploading or downloading data for a run, the SDK first sends a request for a pre-signed upload or download URL to the API. The API enforces access controls, returning an error if the user is missing the necessary permissions. Otherwise, it returns a pre-signed URL which the SDK then uses to upload or download the data.

Uploading data to a run in a given namespace requires write permission to runs in that namespace. Downloading data from a run in a given namespace requires read permission to runs in that namespace.

dbnl.eval.metrics

class dbnl.eval.metrics.Metric

column_schema() → RunSchemaColumnSchemaDict

Returns the column schema for the metric to be used in a run config.

  • Returns: _description_

component() → str | None

description() → str | None

Returns the description of the metric.

  • Returns: Description of the metric.

abstract evaluate(df: pd.DataFrame) → pd.Series[Any]

Evaluates the metric over the provided dataframe.

  • Parameters:df – Input data from which to compute metric.

  • Returns: Metric values.

abstract expression() → str

Returns the expression representing the metric (e.g. rouge1(prediction, target)).

  • Returns: Metric expression.

greater_is_better() → bool | None

If true, larger values are assumed to be directionally better than smaller once. If false, smaller values are assumged to be directionally better than larger one. If None, assumes nothing.

  • Returns: True if greater is better, False if smaller is better, otherwise None.

abstract inputs() → list[str]

Returns the input column names required to compute the metric. :return: Input column names.

abstract metric() → str

Returns the metric name (e.g. rouge1). :return: Metric name.

abstract name() → str

Returns the fully qualified name of the metric (e.g. rouge1__prediction__target).

  • Returns: Metric name.

run_schema_column() → RunSchemaColumnSchema

Returns the column schema for the metric to be used in a run config.

  • Returns: _description_

abstract type() → Literal['boolean', 'int', 'long', 'float', 'double', 'string', 'category']

Returns the type of the metric (e.g. float)

  • Returns: Metric type.

class dbnl.eval.metrics.RougeScoreType(value)

An enumeration.

FMEASURE = 'fmeasure'

PRECISION = 'precision'

RECALL = 'recall'

answer_quality_llm_accuracy

Computes the accuracy of the answer by evaluating the accuracy score of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_accuracy available in dbnl.eval.metrics.prompts.

  • Parameters:

    • input – input column name

    • context – context column name

    • prediction – prediction column name

    • eval_llm_client – eval_llm_client

  • Returns: accuracy metric

answer_quality_llm_answer_correctness

Returns answer correctness metric.

This metric is generated by an LLM using a specific specific prompt named llm_answer_correctness available in dbnl.eval.metrics.prompts.

  • Parameters:

    • input – input column name

    • prediction – prediction column name

    • target – target column name

    • eval_llm_client – eval_llm_client

  • Returns: answer correctness metric

answer_quality_llm_answer_similarity

Returns answer similarity metric.

This metric is generated by an LLM using a specific specific prompt named llm_answer_similarity available in dbnl.eval.metrics.prompts.

  • Parameters:

    • input – input column name

    • prediction – prediction column name

    • target – target column name

    • eval_llm_client – eval_llm_client

  • Returns: answer similarity metric

answer_quality_llm_coherence

Computes the coherence of the answer by evaluating the coherence score of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_coherence available in dbnl.eval.metrics.prompts.

  • Parameters:

    • prediction – prediction column name

    • eval_llm_client – eval_llm_client

  • Returns: coherence metric

answer_quality_llm_commital

Computes the commital of the answer by evaluating the commital score of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_commital available in dbnl.eval.metrics.prompts.

  • Parameters:

    • prediction – prediction column name

    • eval_llm_client – eval_llm_client

  • Returns: commital metric

answer_quality_llm_completeness

Computes the completeness of the answer by evaluating the completeness score of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_completeness available in dbnl.eval.metrics.prompts.

  • Parameters:

    • input – input column name

    • prediction – prediction column

    • eval_llm_client – eval_llm_client

  • Returns: completeness metric

answer_quality_llm_contextual_relevance

Computes the contextual relevance of the answer by evaluating the contextual relevance score of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_contextual_relevance available in dbnl.eval.metrics.prompts.

  • Parameters:

    • input – input column name

    • context – context column name

    • eval_llm_client – eval_llm_client

  • Returns: contextual relevance metric

answer_quality_llm_faithfulness

Returns faithfulness metric.

This metric is generated by an LLM using a specific specific prompt named llm_faithfulness available in dbnl.eval.metrics.prompts.

  • Parameters:

    • input – input column name

    • context – context column name

    • prediction – prediction column name

    • eval_llm_client – eval_llm_client

  • Returns: faithfulness metric

answer_quality_llm_grammar_accuracy

Computes the grammar accuracy of the answer by evaluating the grammar accuracy score of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_grammar_accuracy available in dbnl.eval.metrics.prompts.

  • Parameters:

    • prediction – prediction column name

    • eval_llm_client – eval_llm_client

  • Returns: grammar accuracy metric

answer_quality_llm_metrics

Returns a set of metrics which evaluate the quality of the generated answer. This does not include metrics that require a ground truth.

  • Parameters:

    • input – input column name (i.e. question)

    • prediction – prediction column name (i.e. generated answer)

    • context – context column name (i.e. document or set of documents retrieved)

    • eval_llm_client – eval_llm_client

  • Returns: list of metrics

answer_quality_llm_originality

Computes the originality of the answer by evaluating the originality score of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_originality available in dbnl.eval.metrics.prompts.

  • Parameters:

    • prediction – prediction column name

    • eval_llm_client – eval_llm_client

  • Returns: originality metric

answer_quality_llm_relevance

Returns relevance metric with context.

This metric is generated by an LLM using a specific specific prompt named llm_relevance available in dbnl.eval.metrics.prompts.

  • Parameters:

    • input – input column name

    • context – context column name

    • prediction – prediction column name

    • eval_llm_client – eval_llm_client

  • Returns: answer relevance metric with context

answer_viability_llm_metrics

Returns a list of metrics relevant for a question and answer task.

  • Parameters:

    • prediction – prediction column name (i.e. generated answer)

    • eval_llm_client – eval_llm_client

  • Returns: list of metrics

answer_viability_llm_reading_complexity

Computes the reading complexity of the answer by evaluating the reading complexity score of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_reading_complexity available in dbnl.eval.metrics.prompts.

  • Parameters:

    • prediction – prediction column name

    • eval_llm_client – eval_llm_client

  • Returns: reading complexity metric

answer_viability_llm_sentiment_assessment

Computes the sentiment of the answer by evaluating the sentiment assessment score of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_sentiment_assessment available in dbnl.eval.metrics.prompts.

  • Parameters:

    • prediction – prediction column name

    • eval_llm_client – eval_llm_client

  • Returns: sentiment assessment metric

answer_viability_llm_text_fluency

Computes the text fluency of the answer by evaluating the perplexity of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_text_fluency available in dbnl.eval.metrics.prompts.

  • Parameters:

    • prediction – prediction column name

    • eval_llm_client – eval_llm_client

  • Returns: text fluency metric

answer_viability_llm_text_toxicity

Computes the toxicity of the answer by evaluating the toxicity score of the answer using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_text_toxicity available in dbnl.eval.metrics.prompts.

  • Parameters:

    • prediction – prediction column name

    • eval_llm_client – eval_llm_client

  • Returns: toxicity metric

automated_readability_index

Returns the Automated Readability Index metric for the text_col_name column.

Calculates the Automated Readability Index (ARI) for a given text. ARI is a readability metric that estimates the U.S. school grade level necessary to understand the text, based on the number of characters per word and words per sentence.

  • Parameters:text_col_name – text column name

  • Returns: automated_readability_index metric

bleu

Returns the bleu metric between the prediction and target columns.

The BLEU score is a metric for evaluating a generated sentence to a reference sentence. The BLEU score is a number between 0 and 1, where 1 means that the generated sentence is identical to the reference sentence.

  • Parameters:

    • prediction – prediction column name

    • target – target column name

  • Returns: bleu metric

character_count

Returns the character count metric for the text_col_name column.

  • Parameters:text_col_name – text column name

  • Returns: character_count metric

context_hit

Returns the context hit metric.

This boolean-valued metric is used to evaluate whether the ground truth document is present in the list of retrieved documents. The context hit metric is 1 if the ground truth document is present in the list of retrieved documents, and 0 otherwise.

  • Parameters:

    • ground_truth_document_id – ground_truth_document_id column name

    • retrieved_document_ids – retrieved_document_ids column name

  • Returns: context hit metric

count_metrics

Returns a set of metrics relevant for a question and answer task.

  • Parameters:text_col_name – text column name

  • Returns: list of metrics

flesch_kincaid_grade

Returns the Flesch-Kincaid Grade metric for the text_col_name column.

Calculates the Flesch-Kincaid Grade Level for a given text. The Flesch-Kincaid Grade Level is a readability metric that estimates the U.S. school grade level required to understand the text. It is based on the average number of syllables per word and words per sentence.

  • Parameters:text_col_name – text column name

  • Returns: flesch_kincaid_grade metric

ground_truth_non_llm_answer_metrics

Returns a set of metrics relevant for a question and answer task.

  • Parameters:

    • prediction – prediction column name (i.e. generated answer)

    • target – target column name (i.e. expected answer)

  • Returns: list of metrics

ground_truth_non_llm_retrieval_metrics

Returns a set of metrics relevant for a question and answer task.

  • Parameters:

    • ground_truth_document_id – ground_truth_document_id column name

    • retrieved_document_ids – retrieved_document_ids column name

  • Returns: list of metrics

inner_product_retrieval

Returns the inner product metric between the ground_truth_document_text and top_retrieved_document_text columns.

This metric is used to evaluate the similarity between the ground truth document and the top retrieved document using the inner product of their embeddings. The embedding client is used to retrieve the embeddings for the ground truth document and the top retrieved document. An embedding is a high-dimensional vector representation of a string of text.

  • Parameters:

    • ground_truth_document_text – ground_truth_document_text column name

    • top_retrieved_document_text – top_retrieved_document_text column name

    • embedding_client – embedding client

  • Returns: inner product metric

inner_product_target_prediction

Returns the inner product metric between the prediction and target columns.

This metric is used to evaluate the similarity between the prediction and target columns using the inner product of their embeddings. The embedding client is used to retrieve the embeddings for the prediction and target columns. An embedding is a high-dimensional vector representation of a string of text.

  • Parameters:

    • prediction – prediction column name

    • target – target column name

    • embedding_client – embedding client

  • Returns: inner product metric

levenshtein

Returns the levenshtein metric between the prediction and target columns.

The Levenshtein distance is a metric for evaluating the similarity between two strings. The Levenshtein distance is an integer value, where 0 means that the two strings are identical, and a higher value returns the number of edits required to transform one string into the other.

  • Parameters:

    • prediction – prediction column name

    • target – target column name

  • Returns: levenshtein metric

mrr

Returns the mean reciprocal rank (MRR) metric.

This metric is used to evaluate the quality of a ranked list of documents. The MRR score is a number between 0 and 1, where 1 means that the ground truth document is ranked first in the list. The MRR score is calculated by taking the reciprocal of the rank of the first relevant document in the list.

  • Parameters:

    • ground_truth_document_id – ground_truth_document_id column name

    • retrieved_document_ids – retrieved_document_ids column name

  • Returns: mrr metric

non_llm_non_ground_truth_metrics

Returns a set of metrics relevant for a question and answer task.

  • Parameters:prediction – prediction column name (i.e. generated answer)

  • Returns: list of metrics

quality_llm_text_similarity

Computes the similarty of the prediction and target text by evaluating using a language model.

This metric is generated by an LLM using a specific specific prompt named llm_text_similarity available in dbnl.eval.metrics.prompts.

  • Parameters:

    • prediction – prediction column name

    • eval_llm_client – eval_llm_client

  • Returns: similarity metric

question_and_answer_metrics

Returns a set of metrics relevant for a question and answer task.

  • Parameters:

    • prediction – prediction column name (i.e. generated answer)

    • target – target column name (i.e. expected answer)

    • input – input column name (i.e. question)

    • context – context column name (i.e. document or set of documents retrieved)

    • ground_truth_document_id – ground_truth_document_id containing the information in the target

    • retrieved_document_ids – retrieved_document_ids containing the full context

    • ground_truth_document_text – text containing the information in the target (ideal is for this to be the top retrieved document)

    • top_retrieved_document_text – text of the top retrieved document

    • eval_llm_client – eval_llm_client

    • eval_embedding_client – eval_embedding_client

  • Returns: list of metrics

question_and_answer_metrics_extended

Returns a set of all metrics relevant for a question and answer task.

  • Parameters:

    • prediction – prediction column name (i.e. generated answer)

    • target – target column name (i.e. expected answer)

    • input – input column name (i.e. question)

    • context – context column name (i.e. document or set of documents retrieved)

    • ground_truth_document_id – ground_truth_document_id containing the information in the target

    • retrieved_document_ids – retrieved_document_ids containing the full context

    • ground_truth_document_text – text containing the information in the target (ideal is for this to be the top retrieved document)

    • top_retrieved_document_text – text of the top retrieved document

    • eval_llm_client – eval_llm_client

    • eval_embedding_client – eval_embedding_client

  • Returns: list of metrics

rouge1

Returns the rouge1 metric between the prediction and target columns.

ROUGE-1 is a recall-oriented metric that calculates the overlap of unigrams (individual words) between the predicted/generated summary and the reference summary. It measures how many single words from the reference summary appear in the predicted summary. ROUGE-1 focuses on basic word-level similarity and is used to evaluate the content coverage.

  • Parameters:

    • prediction – prediction column name

    • target – target column name

  • Returns: rouge1 metric

rouge2

Returns the rouge2 metric between the prediction and target columns.

ROUGE-2 is a recall-oriented metric that calculates the overlap of bigrams (pairs of words) between the predicted/generated summary and the reference summary. It measures how many pairs of words from the reference summary appear in the predicted summary. ROUGE-2 focuses on word-level similarity and is used to evaluate the content coverage.

  • Parameters:

    • prediction – prediction column name

    • target – target column name

  • Returns: rouge2 metric

rougeL

Returns the rougeL metric between the prediction and target columns.

ROUGE-L is a recall-oriented metric based on the Longest Common Subsequence (LCS) between the reference and generated summaries. It measures how well the generated summary captures the longest sequences of words that appear in the same order in the reference summary. This metric accounts for sentence-level structure and coherence.

  • Parameters:

    • prediction – prediction column name

    • target – target column name

  • Returns: rougeL metric

rougeLsum

Returns the rougeLsum metric between the prediction and target columns.

ROUGE-LSum is a variant of ROUGE-L that applies the Longest Common Subsequence (LCS) at the sentence level for summarization tasks. It evaluates how well the generated summary captures the overall sentence structure and important elements of the reference summary by computing the LCS for each sentence in the document.

  • Parameters:

    • prediction – prediction column name

    • target – target column name

  • Returns: rougeLsum metric

rouge_metrics

Returns all rouge metrics between the prediction and target columns.

  • Parameters:

    • prediction – prediction column name

    • target – target column name

  • Returns: list of rouge metrics

sentence_count

Returns the sentence count metric for the text_col_name column.

  • Parameters:text_col_name – text column name

  • Returns: sentence_count metric

summarization_metrics

Returns a set of metrics relevant for a summarization task.

  • Parameters:

    • prediction – prediction column name (i.e. generated summary)

    • target – target column name (i.e. expected summary)

  • Returns: list of metrics

text_metrics

Returns a set of metrics relevant for a generic text application

  • Parameters:

    • prediction – prediction column name (i.e. generated text)

    • target – target column name (i.e. expected text)

  • Returns: list of metrics

text_monitor_metrics

token_count

Returns the token count metric for the text_col_name column.

A token is a sequence of characters that represents a single unit of meaning, such as a word or punctuation mark. The token count metric calculates the total number of tokens in the text. Different languages may have different tokenization rules. This function is implemented using the spaCy library.

  • Parameters:text_col_name – text column name

  • Returns: token_count metric

word_count

Returns the word count metric for the text_col_name column.

  • Parameters:text_col_name – text column name

  • Returns: word_count metric

LLM-as-judge and Embedding Metrics

A common strategy for evaluating unstructured text application is to use other LLMs and text embedding models to drive metrics of interest.

Supported LLM and model services

The following examples show how to initialize an llm_eval_client and an eval_embedding_client under different providers.

OpenAI

Azure OpenAI

TogetherAI (or other OpenAI compatible service / endpoints)

Missing Metric Values

It is possible for some of the LLM-as-judge metrics to occasionally return values that are unable to be parsed. These metrics values will surface as None

Distributional is able to accept dataframes including None values. The platform will intelligently filter them when applicable.

Throughput and Rate Limits

LLM service providers often impose request rate limits and token throughput caps. Some example errors that one might encounter are shown below:

In the event you experience these errors, please work with your LLM service provider to adjust your limits. Additionally, feel free to reach out to Distributional support with the issue you are seeing.

dbnl.experimental

create_test

Create a new Test Spec

  • Parameters:test_spec_dict – A dictionary containing the Test Spec schema.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in.

    • DBNLAPIValidationError – Test Spec does not conform to expected format.

    • DBNLDuplicateError – Test Spec with the same name already exists in the Project.

  • Returns: The JSON dict of the created Test Spec object. The return JSON will contain the id of the Test Spec.

Test Spec JSON Structure

create_test_generation_session

Create a Test Generation Session

  • Parameters:

    • run – The Run to use when generating tests.

    • columns – List of columns in the Run to generate tests for. If None, all columns in the Run will be used, defaults to None. If a list of strings, each string is a column name. If a list of dictionaries, each dictionary must have a ‘name’ key, and the value is the column name.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in.

    • DBNLInputValidationError – arguments do not conform to expected format.

  • Returns: The TestGenerationSession that was created.

Examples:

create_test_recalibration_session

Create a Test Recalibration Session by redefining the expected output for tests in a Test Session

  • Parameters:

    • test_session – Test Session to recalibrate

    • feedback – Feedback for the recalibration. Can be ‘PASS’ or ‘FAIL’.

    • test_ids – List of test IDs to recalibrate, defaults to None. If None, all tests in the Test Session will be recalibrated.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in.

    • DBNLInputValidationError – arguments do not conform to expected format.

  • Returns: Test Recalibration Session

IMPORTANT

If some generated Tests failed when they should have passed and some passed when they should have failed, you will need to submit 2 separate calls, one for each feedback result.

get_or_create_tag

Get the specified Test Tag or create a new one if it does not exist

  • Parameters:

    • project_id – The id of the Project that this Test Tag is associated with.

    • name – The name of the Test Tag to create or retrieve.

    • description – An optional description of the Test Tag. Limited to 255 characters.

  • Returns: The dictionary containing the Test Tag

  • Raises:DBNLNotLoggedInError – dbnl SDK is not logged in.

Sample Test Tag JSON

get_test_sessions

Get all Test Sessions in the given Project

  • Parameters:project – Project from which to retrieve Test Sessions

  • Returns: List of Test Sessions

  • Raises:DBNLNotLoggedInError – dbnl SDK is not logged in.

get_tests

Get all Tests executed in the given Test Session

  • Parameters:test_session_id – Test Session ID

  • Returns: List of test JSONs

  • Raises:DBNLNotLoggedInError – dbnl SDK is not logged in.

Sample Test JSON

prepare_incomplete_test_spec_payload

Formats a Test Spec payload for the API. Add project_id if it is not present. Replace tag_names with tag_ids.

  • Parameters:

    • test_spec_dict – A dictionary containing the Test Spec schema.

    • project_id – The Project ID, defaults to None. If project_id does not exist in test_spec_dict, it is required as an argument.

  • Raises:DBNLInputValidationError – Input does not conform to expected format

  • Returns: The dictionary containing the newly formatted Test Spec payload.

wait_for_test_generation_session

Wait for a Test Generation Session to finish. Polls every 3 seconds until it is completed.

  • Parameters:

    • test_generation_session – The TestGenerationSession to wait for.

    • timeout_s – The total wait time (in seconds) for Test Generation Session to complete, defaults to 180.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in.

    • DBNLError – Test Generation Session did not complete after waiting for the timeout_s seconds

  • Returns: The completed TestGenerationSession

wait_for_test_recalibration_session

Wait for a Test Recalibration Session to finish. Polls every 3 seconds until it is completed.

  • Parameters:

    • test_recalibration_session – The TestRecalibrationSession to wait for.

    • timeout_s – The total wait time (in seconds) for Test Recalibration Session to complete, defaults to 180.

  • Returns: The completed TestRecalibrationSession

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in.

    • DBNLError – Test Recalibration Session did not complete after waiting for the timeout_s seconds

wait_for_test_session

Wait for a Test Session to finish. Polls every 3 seconds until it is completed.

  • Parameters:

    • test_session – The TestSession to wait for

    • timeout_s – The total wait time (in seconds) for Test Session to complete, defaults to 180.

  • Returns: The completed TestSession

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in.

    • DBNLError – Test Session did not complete after waiting for the timeout_s seconds

dbnl

close_run

Mark the specified dbnl Run status as closed. A closed run is finalized and considered complete. Once a Run is marked as closed, it can no longer be used for reporting Results.

Note that the Run will not be closed immediately. It will transition into a closing state and will be closed in the background. If wait_for_close is set to True, the function will block for up to 3 minutes until the Run is closed.

  • Parameters:

    • run – The Run to be closed

    • wait_for_close – If True, the function will block for up to 3 minutes until the Run is closed, defaults to True

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLInputValidationError – Input does not conform to expected format

    • DBNLError – Run did not close after waiting for 3 minutes

IMPORTANT

A run must be closed for uploaded results to be shown on the UI.

copy_project

Copy a Project; a convenience method wrapping exporting and importing a project with a new name and description

  • Parameters:

    • project – The project to copy

    • name – A name for the new Project

    • description – An optional description for the new Project. Description is limited to 255 characters.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLInputValidationError – Input does not conform to expected format

    • DBNLConflictingProjectError – Project with the same name already exists

  • Returns: The newly created Project

Examples:

create_metric

Create a new DBNL Metric

  • Parameters:

    • project – DBNL Project to create the Metric for

    • name – Name for the Metric

    • expression_template – Expression template string e.g. token_count({RUN}.question)

    • description – Optional description of what computation the metric is performing

    • greater_is_better – Flag indicating whether greater values are semantically ‘better’ than lesser values

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLInputValidationError – Input does not conform to expected format

  • Returns: Created Metric

create_project

Create a new Project

  • Parameters:

    • name – Name for the Project

    • description – Description for the DBNL Project, defaults to None. Description is limited to 255 characters.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLAPIValidationError – DBNL API failed to validate the request

    • DBNLConflictingProjectError – Project with the same name already exists

  • Returns: Project

Examples:

create_run

Create a new Run

  • Parameters:

    • project – The Project this Run is associated with.

    • run_schema – The schema for data that will be associated with this run. DBNL will validate data you upload against this schema.

    • display_name – An optional display name for the Run, defaults to None. display_name does not have to be unique.

    • metadata – Additional key-value pairs you want to track, defaults to None.

    • run_config – (Deprecated) Do not use. Use run_schema instead.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLInputValidationError – Input does not conform to expected format

  • Returns: Newly created Run

create_run_config

(Deprecated) Please see create_run_schema instead.

  • Parameters:

    • project – DBNL Project this RunConfig is associated to

    • columns – List of column schema specs for the uploaded data, required keys name and type, optional key component, description and greater_is_better. type can be int, float, category, boolean, or string. component is a string that indicates the source of the data. e.g. “component” : “sentiment-classifier” or “component” : “fraud-predictor”. Specified components must be present in the components_dag dictionary. greater_is_better is a boolean that indicates if larger values are better than smaller ones. False indicates smaller values are better. None indicates no preference. An example RunConfig columns: columns=[{“name”: “pred_proba”, “type”: “float”, “component”: “fraud-predictor”}, {“name”: “decision”, “type”: “boolean”, “component”: “threshold-decision”}, {“name”: “error_type”, “type”: “category”}]

    • scalars –

      List of scalar schema specs for the uploaded data, required keys name and type, optional key component, description and greater_is_better. : type can be int, float, category, boolean, or string. component is a string that indicates the source of the data. e.g. “component” : “sentiment-classifier” or “component” : “fraud-predictor”. Specified components must be present in the components_dag dictionary.

      greater_is_better is a boolean that indicates if larger values are better than smaller ones. False indicates smaller values are better. None indicates no preference.

      An example RunConfig scalars: scalars=[{“name”: “accuracy”, “type”: “float”, “component”: “fraud-predictor”}, {“name”: “error_type”, “type”: “category”}]

    • description – Description for the DBNL RunConfig, defaults to None. Description is limited to 255 characters.

    • display_name – Display name for the RunConfig, defaults to None. display_name does not have to be unique.

    • row_id – List of column names that are the unique identifier, defaults to None.

    • components_dag – Optional dictionary representing the DAG of components, defaults to None. eg : {“fraud-predictor”: [‘threshold-decision”], “threshold-decision”: []},

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLInputValidationError – Input does not conform to expected format

  • Returns: RunConfig with the desired columns schema

create_run_config_from_results

(Deprecated) Please see create_run_schema_from_results instead.

  • Parameters:

    • project – DBNL Project to create the RunConfig for

    • column_data – DataFrame with the results for the columns

    • scalar_data – Dictionary or DataFrame with the results for the scalars, defaults to None

    • description – Description for the RunConfig, defaults to None

    • display_name – Display name for the RunConfig, defaults to None

    • row_id – List of column names that are the unique identifier, defaults to None

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLInputValidationError – Input does not conform to expected format

  • Returns: RunConfig with the desired schema for columns and scalars, if provided

create_run_query

Create a new RunQuery for a project to use as a baseline Run. Currently supports key=”offset_from_now” with value as a positive integer, representing the number of runs to go back for the baseline. For example, query={“offset_from_now”: 1} will use the latest run as the baseline, so that each run compares against the previous run.

  • Parameters:

    • project – The Project to create the RunQuery for

    • name – A name for the RunQuery

    • query – A dict describing how to find a Run dynamically. Currently, only supports “offset_from_now”: int as a key-value pair.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLInputValidationError – Input does not conform to expected format

  • Returns: A new dbnl RunQuery, typically used for finding a Dynamic Baseline for a Test Session

Examples:

create_run_schema

Create a new RunSchema

  • Parameters:

    • columns – List of column schema specs for the uploaded data, required keys name and type, optional keys component, description and greater_is_better.

    • scalars – List of scalar schema specs for the uploaded data, required keys name and type, optional keys component, description and greater_is_better.

    • index – Optional list of column names that are the unique identifier.

    • components_dag – Optional dictionary representing the DAG of components.

  • Returns: The RunSchema

Supported Types

  • int

  • float

  • boolean

  • string

  • category

  • list

Components

The optional component key is for specifying the source of the data column in relationship to the AI/ML app subcomponents. Components are used in visualizing the components DAG.

The components_dag dictionary specifies the topological layout of the AI/ML app. For each key-value pair, the key represents the source component, and the value is a list of the leaf components. The following code snippet describes the DAG shown above.

Examples:

Basic

With `scalars`, `index`, and `components_dag`

create_run_schema_from_results

Create a new RunSchema from the column results, as well as scalar results if provided

  • Parameters:

    • column_data – A pandas DataFrame with all the column results for which we want to generate a RunSchema.

    • scalar_data – A dict or pandas DataFrame with all the scalar results for which we want to generate a RunSchema.

    • index – An optional list of the column names that can be used as unique identifiers.

  • Raises:DBNLInputValidationError – Input does not conform to expected format

  • Returns: The RunSchema based on the provided results

Examples:

create_test_session

Create a new TestSession with the given Run as the Experiment Run, and the given Run or RunQuery as the baseline if provided

  • Parameters:

    • experiment_run – The Run to create the TestSession for

    • baseline – The Run or RunQuery to use as the Baseline Run, defaults to None. If None, the Baseline set for the Project is used.

    • include_tags – Optional list of Test Tag names to include in the Test Session.

    • exclude_tags – Optional list of Test Tag names to exclude in the Test Session.

    • require_tags – Optional list of Test Tag names to require in the Test Session.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLInputValidationError – Input does not conform to expected format

  • Returns: The newly created TestSession

Calling this will start evaluating Tests associated with a Run. Typically, the Run you just completed will be the “Experiment” and you’ll compare it to some earlier “Baseline Run”.

IMPORTANT

Referenced Runs must already be closed before a Test Session can begin.

Managing Tags

Suppose we have the following Tests with the associated Tags in our Project

  • Test1 with tags [“A”, “B”]

  • Test2 with tags [“A”]

  • Test3 with tags [“B”]

include_tags=[“A”, “B”] will trigger Tests 1, 2, and 3. require_tags=[“A”, “B”] will only trigger Test 1. exclude_tags=[“A”] will only trigger Test 3. include_tags=[“A”] and exclude_tags=[“B”] will only trigger Test 2.

Examples:

delete_metric

Delete a DBNL Metric by ID

  • Parameters:metric_id – ID of the metric to delete

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLAPIValidationError – DBNL API failed to validate the request

  • Returns: None

export_project_as_json

Export a Project alongside its Test Specs, Tags, and Notification Rules as a JSON object

  • Parameters:project – The Project to export as JSON.

  • Raises:DBNLNotLoggedInError – dbnl SDK is not logged in

  • Returns: JSON object representing the Project

Sample Project JSON

Examples:

get_column_results

Get column results for a Run

  • Parameters:run – The Run from which to retrieve the results.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLInputValidationError – Input does not conform to expected format

    • DBNLDownloadResultsError – Failed to download results (e.g. Run is not closed)

  • Returns: A pandas DataFrame of the column results for the Run.

IMPORTANT

You can only retrieve results for a Run that has been closed.

Examples:

get_latest_run

Get the latest Run for a project

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLResourceNotFoundError – Run not found

  • Parameters:project – The Project to get the latest Run for

  • Returns: The latest Run

get_latest_run_config

(Deprecated) Please see get_latest_run and access the schema attribute instead.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLResourceNotFoundError – RunConfig not found

  • Parameters:project – DBNL Project to get the latest RunConfig for

  • Returns: Latest RunConfig

get_metric_by_id

Get a DBNL Metric by ID

  • Parameters:metric_id – ID of the metric to get

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLAPIValidationError – DBNL API failed to validate the request

  • Returns: The requested metric

get_my_namespaces

Get all the namespaces that the user has access to

  • Raises:DBNLNotLoggedInError – dbnl SDK is not logged in

  • Returns: List of namespaces

get_or_create_project

Get the Project with the specified name or create a new one if it does not exist

  • Parameters:

    • name – Name for the Project

    • description – Description for the DBNL Project, defaults to None

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLAPIValidationError – DBNL API failed to validate the request

  • Returns: Newly created or matching existing Project

Examples:

get_project

Retrieve a Project by name.

  • Parameters:name – The name for the existing Project.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLProjectNotFoundError – Project with the given name does not exist.

  • Returns: Project

Examples:

get_results

Get all results for a Run

  • Parameters:run – The Run from which to retrieve the results.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLInputValidationError – Input does not conform to expected format

    • DBNLDownloadResultsError – Failed to download results (e.g. Run is not closed)

  • Returns: A named tuple comprised of columns and scalars fields. These are the pandas DataFrames of the uploaded data for the Run.

IMPORTANT

You can only retrieve results for a Run that has been closed.

Examples:

get_run

Retrieve a Run with the given ID

  • Parameters:run_id – The ID of the dbnl Run. Run ID starts with the prefix run_. Run ID can be found at the Run detail page.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLInputValidationError – Input does not conform to expected format

    • DBNLRunNotFoundError – A Run with the given ID does not exist.

  • Returns: The Run with the given run_id.

Examples:

get_run_config

(Deprecated) Please access Run.schema instead.

  • Parameters:run_config_id – The ID of the DBNL RunConfig to retrieve

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLInputValidationError – Input does not conform to expected format

  • Returns: RunConfig with the given run_config_id

get_run_config_from_latest_run

(Deprecated) Please see get_latest_run and access the schema attribute instead.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLResourceNotFoundError – RunConfig not found

  • Parameters:project – DBNL Project to get the latest RunConfig for

  • Returns: RunConfig from the latest Run

get_run_query

Retrieve a DBNL RunQuery with the given name, unique to a project

  • Parameters:

    • project – The Project from which to retrieve the RunQuery.

    • name – The name of the RunQuery to retrieve.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLRessourceNotFoundError – RunQuery not found

  • Returns: RunQuery with the given name.

Examples:

get_scalar_results

Get scalar results for a Run

  • Parameters:run – The Run from which to retrieve the scalar results.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLInputValidationError – Input does not conform to expected format

    • DBNLDownloadResultsError – Failed to download results (e.g. Run is not closed)

  • Returns: A pandas DataFrame of the scalar results for the Run.

IMPORTANT

You can only retrieve results for a Run that has been closed.

Examples:

import_project_from_json

Create a new Project from a JSON object

  • Parameters:params – JSON object representing the Project, generally based on a Project exported via export_project_as_json(). See export_project_as_json() for the expected format.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLAPIValidationError – DBNL API failed to validate the request

    • DBNLConflictingProjectError – Project with the same name already exists

  • Returns: Project created from the JSON object

Examples:

login

Setup dbnl SDK to make authenticated requests. After login is run successfully, the dbnl client will be able to issue secure and authenticated requests against hosted endpoints of the dbnl service.

  • Parameters:

    • api_token – DBNL API token for authentication; token can be found at /tokens page of the DBNL app. If None is provided, the environment variable DBNL_API_TOKEN will be used by default.

    • namespace_id – DBNL namespace ID to use for the session; available namespaces can be found with get_my_namespaces().

    • api_url – The base url of the Distributional API. For SaaS users, set this variable to api.dbnl.com. For other users, please contact your sys admin. If None is provided, the environment variable DBNL_API_URL will be used by default.

    • app_url – An optional base url of the Distributional app. If this variable is not set, the app url is inferred from the DBNL_API_URL variable. For on-prem users, please contact your sys admin if you cannot reach the Distributional UI.

report_column_results

Report all column results to dbnl

  • Parameters:

    • run – The Run that the results will be reported to

    • data – A pandas DataFrame with all the results to report to dbnl. The columns of the DataFrame must match the columns of the Run’s schema.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLInputValidationError – Input does not conform to expected format

IMPORTANT

All data should be reported to dbnl at once. Calling dbnl.report_column_results more than once will overwrite the previously uploaded data.

WARNING

Once a Run is closed, you can no longer call report_column_results to send data to dbnl.

Examples:

report_results

Report all results to dbnl

  • Parameters:

    • run – The Run that the results will be reported to

    • column_data – A pandas DataFrame with all the results to report to dbnl. The columns of the DataFrame must match the columns of the Run’s schema.

    • scalar_data – A dictionary or single-row pandas DataFrame with the scalar results to report to dbnl, defaults to None.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLInputValidationError – Input does not conform to expected format

IMPORTANT

All data should be reported to dbnl at once. Calling dbnl.report_results more than once will overwrite the previously uploaded data.

WARNING

Once a Run is closed, you can no longer call report_results to send data to dbnl.

Examples:

report_run_with_results

Create a new Run, report results to it, and close it.

  • Parameters:

    • project – The Project to create the Run in.

    • column_data – A pandas DataFrame with the results for the columns.

    • scalar_data – An optional dictionary or DataFrame with the results for the scalars, if any.

    • display_name – An optional display name for the Run.

    • index – An optional list of column names to use as the unique identifier for rows in the column data.

    • run_schema – An optional RunSchema to use for the Run. Will be inferred from the data if not provided.

    • metadata – Any additional key:value pairs you want to track.

    • wait_for_close – If True, the function will block for up to 3 minutes until the Run is closed, defaults to True.

    • row_id – (Deprecated) Do not use. Use index instead.

    • run_config_id – (Deprecated) Do not use. Use run_schema instead.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLInputValidationError – Input does not conform to expected format

  • Returns: The closed Run with the uploaded data.

IMPORTANT

If no schema is provided, the schema will be inferred from the data. If provided, the schema will be used to validate the data.

Examples:

Implicit Schema

Explicit Schema

report_run_with_results_and_start_test_session

Create a new Run, report results to it, and close it. Wait for close to finish and start a TestSession with the given inputs.

  • Parameters:

    • project – The Project to create the Run in.

    • column_data – A pandas DataFrame with the results for the columns.

    • scalar_data – An optional dictionary or DataFrame with the results for the scalars, if any.

    • display_name – An optional display name for the Run.

    • index – An optional list of column names to use as the unique identifier for rows in the column data.

    • run_schema – An optional RunSchema to use for the Run. Will be inferred from the data if not provided.

    • metadata – Any additional key:value pairs you want to track.

    • wait_for_close – If True, the function will block for up to 3 minutes until the Run is closed, defaults to True.

    • baseline – DBNL Run or RunQuery to use as the baseline run, defaults to None. If None, the baseline defined in the TestConfig is used.

    • include_tags – Optional list of Test Tag names to include in the Test Session.

    • exclude_tags – Optional list of Test Tag names to exclude in the Test Session.

    • require_tags – Optional list of Test Tag names to require in the Test Session.

    • row_id – (Deprecated) Do not use. Use index instead.

    • run_config_id – (Deprecated) Do not use. Use run_schema instead.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLInputValidationError – Input does not conform to expected format

  • Returns: The closed Run with the uploaded data.

IMPORTANT

If no schema is provided, the schema will be inferred from the data. If provided, the schema will be used to validate the data.

Examples:

report_scalar_results

Report scalar results to dbnl

  • Parameters:

    • run – The Run that the scalars will be reported to

    • data – A dictionary or single-row pandas DataFrame with the scalar results to report to dbnl.

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLInputValidationError – Input does not conform to expected format

IMPORTANT

All data should be reported to dbnl at once. Calling dbnl.report_scalar_results more than once will overwrite the previously uploaded data.

WARNING

Once a Run is closed, you can no longer call report_scalar_results to send data to dbnl.

Examples:

set_run_as_baseline

Set the given Run as the Baseline Run in the Project’s Test Config

  • Parameters:run – The Run to set as the Baseline Run.

  • Raises:DBNLResourceNotFoundError – If the test configurations are not found for the project.

set_run_query_as_baseline

Set a given RunQuery as the Baseline Run in a Project’s Test Config

  • Parameters:run_query – The RunQuery to set as the Baseline RunQuery.

  • Raises:DBNLResourceNotFoundError – If the test configurations are not found for the project.

wait_for_run_close

Wait for a Run to close. Polls every polling_interval_s seconds until it is closed.

  • Parameters:

    • run – Run to wait for

    • timeout_s – Total wait time (in seconds) for Run to close, defaults to 180.0

    • polling_interval_s – Time between polls (in seconds), defaults to 3.0

  • Raises:

    • DBNLNotLoggedInError – dbnl SDK is not logged in

    • DBNLError – Run did not close after waiting for the timeout_s seconds

Create Test page

For access to the Helm chart and to get registry credentials, .

A Kubernetes cluster (e.g. , ).

An or controller (e.g. , )

A PostgreSQL database (e.g. , ).

An object store bucket (e.g. , ) to store raw data.

A Redis database (e.g. , ) to act as a messaging queue.

An RSA key pair to sign the .

Install and set the Kubernetes cluster context.

Install .

The Helm chart can be installed directly using or using your chart release management tool of choice such as or .

All data accesses are mediated by the API ensuring the enforcement of access controls. For more details on permissions, see .

Direct object store access is required to upload or download raw run data using the SDK. are used to provide limited direct access. This access is limited in both time and scope, ensuring only data for a specific run is accessible and that it is only accessible for a limited time.

The LLM-as-judge in dbnl.eval support OpenAI, Azure OpenAI and any other third-party LLM / embedding model provider that is compatible with the OpenAI python client. Specifically, third-party endpoints should (mostly) adhere to the schema of:

endpoint for LLMs

endpoint for embedding models

please reach out to our team
EKS
GKE
Ingress
Gateway
aws-load-balancer-controller
ingress-gce
RDS
CloudSQL
S3
GCS
ElasticCache
Memorystore
personal access tokens
kubectl
helm
helm install
ArgoCD
FluxCD
dbnl.eval.metrics.answer_quality_llm_accuracy(input: str, context: str, prediction: str, eval_llm_client: LLMClient) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.answer_quality_llm_answer_correctness(input: str, prediction: str, target: str, eval_llm_client: LLMClient) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.answer_quality_llm_answer_similarity(input: str, prediction: str, target: str, eval_llm_client: LLMClient) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.answer_quality_llm_coherence(prediction: str, eval_llm_client: LLMClient) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.answer_quality_llm_commital(prediction: str, eval_llm_client: LLMClient) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.answer_quality_llm_completeness(input: str, prediction: str, eval_llm_client: LLMClient) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.answer_quality_llm_contextual_relevance(input: str, context: str, eval_llm_client: LLMClient) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.answer_quality_llm_faithfulness(input: str, context: str, prediction: str, eval_llm_client: LLMClient) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.answer_quality_llm_grammar_accuracy(prediction: str, eval_llm_client: LLMClient) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.answer_quality_llm_metrics(input: str | None, prediction: str, context: str | None, target: str | None, eval_llm_client: LLMClient) → list[[Metric](#dbnl.eval.metrics.Metric)]
dbnl.eval.metrics.answer_quality_llm_originality(prediction: str, eval_llm_client: LLMClient) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.answer_quality_llm_relevance(input: str, context: str, prediction: str, eval_llm_client: LLMClient) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.answer_viability_llm_metrics(prediction: str, eval_llm_client: LLMClient) → list[[Metric](#dbnl.eval.metrics.Metric)]
dbnl.eval.metrics.answer_viability_llm_reading_complexity(prediction: str, eval_llm_client: LLMClient) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.answer_viability_llm_sentiment_assessment(prediction: str, eval_llm_client: LLMClient) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.answer_viability_llm_text_fluency(prediction: str, eval_llm_client: LLMClient) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.answer_viability_llm_text_toxicity(prediction: str, eval_llm_client: LLMClient) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.automated_readability_index(text_col_name: str) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.bleu(prediction: str, target: str) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.character_count(text_col_name: str) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.context_hit(ground_truth_document_id: str, retrieved_document_ids: str) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.count_metrics(text_col_name: str) → list[[Metric](#dbnl.eval.metrics.Metric)]
dbnl.eval.metrics.flesch_kincaid_grade(text_col_name: str) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.ground_truth_non_llm_answer_metrics(prediction: str, target: str) → list[[Metric](#dbnl.eval.metrics.Metric)]
dbnl.eval.metrics.ground_truth_non_llm_retrieval_metrics(ground_truth_document_id: str, retrieved_document_ids: str) → list[[Metric](#dbnl.eval.metrics.Metric)]
dbnl.eval.metrics.inner_product_retrieval(ground_truth_document_text: str, top_retrieved_document_text: str, eval_embedding_client: EmbeddingClient) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.inner_product_target_prediction(prediction: str, target: str, eval_embedding_client: EmbeddingClient) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.levenshtein(prediction: str, target: str) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.mrr(ground_truth_document_id: str, retrieved_document_ids: str) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.non_llm_non_ground_truth_metrics(prediction: str) → list[[Metric](#dbnl.eval.metrics.Metric)]
dbnl.eval.metrics.quality_llm_text_similarity(prediction: str, target: str, eval_llm_client: LLMClient) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.question_and_answer_metrics(prediction: str, target: str | None = None, input: str | None = None, context: str | None = None, ground_truth_document_id: str | None = None, retrieved_document_ids: str | None = None, ground_truth_document_text: str | None = None, top_retrieved_document_text: str | None = None, eval_llm_client: LLMClient | None = None, eval_embedding_client: EmbeddingClient | None = None) → list[[Metric](#dbnl.eval.metrics.Metric)]
dbnl.eval.metrics.question_and_answer_metrics_extended(prediction: str, target: str | None = None, input: str | None = None, context: str | None = None, ground_truth_document_id: str | None = None, retrieved_document_ids: str | None = None, ground_truth_document_text: str | None = None, top_retrieved_document_text: str | None = None, eval_llm_client: LLMClient | None = None, eval_embedding_client: EmbeddingClient | None = None) → list[[Metric](#dbnl.eval.metrics.Metric)]
dbnl.eval.metrics.rouge1(prediction: str, target: str, score_type: [RougeScoreType](#dbnl.eval.metrics.RougeScoreType) = RougeScoreType.FMEASURE) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.rouge2(prediction: str, target: str, score_type: [RougeScoreType](#dbnl.eval.metrics.RougeScoreType) = RougeScoreType.FMEASURE) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.rougeL(prediction: str, target: str, score_type: [RougeScoreType](#dbnl.eval.metrics.RougeScoreType) = RougeScoreType.FMEASURE) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.rougeLsum(prediction: str, target: str, score_type: [RougeScoreType](#dbnl.eval.metrics.RougeScoreType) = RougeScoreType.FMEASURE) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.rouge_metrics(prediction: str, target: str) → list[[Metric](#dbnl.eval.metrics.Metric)]
dbnl.eval.metrics.sentence_count(text_col_name: str) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.summarization_metrics(prediction: str, target: str | None = None, eval_embedding_client: EmbeddingClient | None = None) → list[[Metric](#dbnl.eval.metrics.Metric)]
dbnl.eval.metrics.text_metrics(prediction: str, target: str | None = None, eval_llm_client: LLMClient | None = None, eval_embedding_client: EmbeddingClient | None = None) → list[[Metric](#dbnl.eval.metrics.Metric)]
dbnl.eval.metrics.text_monitor_metrics(columns: list[str], eval_llm_client: LLMClient | None = None) → list[[Metric](#dbnl.eval.metrics.Metric)]
dbnl.eval.metrics.token_count(text_col_name: str) → [Metric](#dbnl.eval.metrics.Metric)
dbnl.eval.metrics.word_count(text_col_name: str) → [Metric](#dbnl.eval.metrics.Metric)
from openai import OpenAI
from dbnl.eval.llm import OpenAILLMClient
from dbnl.eval.embedding_clients import OpenAIEmbeddingClient

# create client for LLM-as-judge metrics
base_oai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
eval_llm_client = OpenAILLMClient.from_existing_client(
    base_oai_client, llm_model="gpt-3.5-turbo-0125"
)

embd_client = OpenAIEmbeddingClient.from_existing_client(
    base_oai_client, embedding_model="text-embedding-ada-002"
)
from openai import AzureOpenAI
from dbnl.eval.llm import AzureOpenAILLMClient
from dbnl.eval.embedding_clients import AzureOpenAIEmbeddingClient

base_azure_oai_client = AzureOpenAI(
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version=os.environ["OPENAI_API_VERSION"], # eg 2023-12-01-preview
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"] # eg https://resource-name.openai.azure.com
)
eval_llm_client = AzureOpenAILLMClient.from_existing_client(
    base_azure_oai_client, llm_model="gpt-35-turbo-16k"
)
embd_client = AzureOpenAIEmbeddingClient.from_existing_client(
    base_azure_oai_client, embedding_model="text-embedding-ada-002"
)
from openai import OpenAI
from dbnl.eval.llm import OpenAILLMClient
base_oai_client = OpenAI(
    api_key=os.environ["TOGETHERAI_API_KEY"],
    base_url="https://api.together.xyz/v1",
)

eval_llm_client = OpenAILLMClient.from_existing_client(
    base_oai_client, llm_model='meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo'
)
{'code': '429', 'message': 'Requests to the Embeddings_Create Operation under 
  Azure OpenAI API version XXXX have exceeded call rate limit of your current 
  OpenAI pricing tier. Please retry after 86400 seconds. 
  Please go here: https://aka.ms/oai/quotaincrease if you would 
  like to further increase the default rate limit.'}
{'message': 'You have been rate limited. Your rate limit is YYY queries per
minute. Please navigate to https://www.together.ai/forms/rate-limit-increase 
to request a rate limit increase.', 'type': 'credit_limit', 
'param': None, 'code': None}
{'message': 'Rate limit reached for gpt-4 in organization XXXX on 
tokens per min (TPM): Limit WWWWW, Used YYYY, Requested ZZZZ. 
Please try again in 1.866s. Visit https://platform.openai.com/account/rate-limits 
to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}
dbnl.experimental.create_test(*, test_spec_dict: TestSpecDict) → dict[str, Any]
{
    "project_id": string,

    # Test data
    "name": string, // must be unique to Project
    "description": string | null,
    "statistic_name": string,
    "statistic_params": map[string, any],
    "statistic_inputs": list[
        {
            "select_query_template": {
                "select": string, // a column or a function on column(s)
                "filter": string | null
            }
        }
    ],
    "assertion": {
        "name": string,
        "params": map[string, any]
    },
    "tag_ids": string[] | null
}
dbnl.experimental.create_test_generation_session(*, run: Run, columns: list[str | dict[Literal['name'], str]] | None = None) → TestGenerationSession
import dbnl
dbnl.login()


run = dbnl.get_run(run_id="run_0000000")
dbnl.experimental.create_test_generation_session(
    run=run,
    columns=["col1", "col4"],
)
dbnl.experimental.create_test_recalibration_session(*, test_session: TestSession, feedback: str, test_ids: list[str] | None = None) → TestRecalibrationSession
dbnl.experimental.get_or_create_tag(*, project_id: str, name: str, description: str | None = None) → dict[str, Any]
{
    # Tag metadata
    "id": string,
    "org_id": string,
    "created_at": timestamp,
    "updated_at": timestamp,

    # Tag data
    "name": string,
    "author_id": string,
    "description": string?,
    "project_id": string,
}
dbnl.experimental.get_test_sessions(*, project: Project) → list[TestSession]
dbnl.experimental.get_tests(*, test_session_id: str) → list[dict[str, Any]]
{
    # Test metadata
    "id": string,
    "org_id": string,
    "created_at": timestamp,
    "updated_at": timestamp,
    "test_session_id": string,

    # Test data
    "author_id": string,
    "value": any?,
    "failure": string?,
    "status": enum(PENDING, RUNNING, PASSED, FAILED),
    "started_at": timestamp?,
    "completed_at": timestamp?,

    # Test Spec data
    "test_spec_id": id,
    "name": string,
    "description": string?,
    "statistic_name": string,
    "statistic_params": map[string, any],
    "assertion": {
        "name": string,
        "params": map[string, any]
        "status": enum(...),
        "failure": string?
    },
    "statistic_inputs": list[
        {
        "select_query_template": {
            "select": string
        }
        }
    ],
    "tag_ids": string[]?,
    }
dbnl.experimental.prepare_incomplete_test_spec_payload(*, test_spec_dict: IncompleteTestSpecDict, project_id: str | None = None) → TestSpecDict
dbnl.experimental.wait_for_test_generation_session(*, test_generation_session: TestGenerationSession, timeout_s: int = 180) → TestGenerationSession
dbnl.experimental.wait_for_test_recalibration_session(*, test_recalibration_session: TestRecalibrationSession, timeout_s: int = 180) → TestRecalibrationSession
dbnl.experimental.wait_for_test_session(*, test_session: TestSession, timeout_s: int = 180) → TestSession
programmatically created
create_test
dbnl.close_run(*, run: Run, wait_for_close: bool = True) → None
dbnl.copy_project(*, project: Project, name: str, description: str | None = None) → Project
import dbnl
dbnl.login()


proj1 = dbnl.get_or_create_project(name="test_proj1")
proj2 = dbnl.copy_project(project=proj1, name="test_proj2")

assert proj2.name == "test_proj2"
dbnl.create_metric(*, project: Project, name: str, expression_template: str, description: str | None = None, greater_is_better: bool | None = None) → Metric
dbnl.create_project(*, name: str, description: str | None = None) → Project
import dbnl
dbnl.login()


proj_1 = dbnl.create_project(name="test_p1")

# DBNLConflictingProjectError: A DBNL Project with name test_p1 already exists.
proj_2 = dbnl.create_project(name="test_p1")
dbnl.create_run(*, project: Project, run_schema: RunSchema | None = None, run_config: RunConfig | None = None, display_name: str | None = None, metadata: dict[str, str] | None = None) → Run
dbnl.create_run_config(*, project: Project, columns: Sequence[RunConfigPrimitiveColumnSchemaDict | RunConfigContainerColumnSchemaDict], scalars: Sequence[RunConfigPrimitiveScalarSchemaDict | RunConfigContainerScalarSchemaDict] | None = None, description: str | None = None, display_name: str | None = None, row_id: list[str] | None = None, components_dag: dict[str, list[str]] | None = None) → RunConfig
dbnl.create_run_config_from_results(project: Project, column_data: DataFrame, scalar_data: dict[str, Any] | DataFrame | None = None, description: str | None = None, display_name: str | None = None, row_id: list[str] | None = None) → RunConfig
dbnl.create_run_query(project: Project, name: str, query: dict[str, Any]) → RunQuery
import dbnl
dbnl.login()


proj1 = dbnl.get_or_create_project(name="test_p1")
run_query1 = dbnl.create_run_query(
    project=project,
    name="look back 3",
    query={
        "offset_from_now": 3,
    },
)
dbnl.create_run_schema(columns: Sequence[RunSchemaColumnSchemaDict], scalars: Sequence[RunSchemaScalarSchemaDict] | None = None, index: list[str] | None = None, components_dag: dict[str, list[str]] | None = None) → RunSchema
components_dags={
    "TweetSource": ["EntityExtractor", "SentimentClassifier"],
    "EntityExtractor": ["TradeRecommender"],
    "SentimentClassifier": ["TradeRecommender"],
    "TradeRecommender": [],
    "Global": [],
}
import dbnl
dbnl.login()


proj = dbnl.get_or_create_project(name="test_p1")
schema = dbnl.create_run_schema(
    project=proj,
    columns=[
        {"name": "error_type", "type": "category"},
        {"name": "email", "type": "string", "description": "raw email text content from source"},
        {"name": "spam-pred", "type": "boolean"},
    ],
)
import dbnl
dbnl.login()

proj = dbnl.get_or_create_project(name="test_p1")
schema = dbnl.create_run_schema(
    columns=[
        {"name": "error_type", "type": "category", "component": "classifier"},
        {"name": "email", "type": "string", "description": "raw email text content from source", "component": "input"},
        {"name": "spam-pred", "type": "boolean", "component": "classifier"},
        {"name": "email_id", "type": "string", "description": "unique id for each email"},
    ],
    scalars=[
        {"name": "model_F1", "type": "float"},
        {"name": "model_recall", "type": "float"},
    ],
    index=["email_id"],
    components_dag={
        "input": ["classifier"],
        "classifier": [],
    },
)
dbnl.create_run_schema_from_results(column_data: DataFrame, scalar_data: dict[str, Any] | DataFrame | None = None, index: list[str] | None = None) → RunSchema
import dbnl
impodt pandas as pd

dbnl.login()

column_data = pd.DataFrame({
    "id": [1, 2, 3],
    "question": [
        "What is the meaning of life?",
        "What is the airspeed velocity of an unladen swallow?",
        "What is the capital of Assyria?",
    ],
})
scalar_data = {"int_scalar": 42, "string_scalar": "foobar"}

run_schema = dbnl.create_run_schema_from_results(
    column_data=column_data,
    scalar_data=scalar_data,
    index=["id"],
)
dbnl.create_test_session(*, experiment_run: Run, baseline: Run | RunQuery | None = None, include_tags: list[str] | None = None, exclude_tags: list[str] | None = None, require_tags: list[str] | None = None) → TestSession
import dbnl
dbnl.login()

run = dbnl.get_run(run_id="run_0000000")
# Will default baseline to the Project's Baseline
dbnl.create_test_session(
    experiment_run=run,
)
dbnl.delete_metric(*, metric_id: str) → None
dbnl.export_project_as_json(*, project: Project) → dict[str, Any]
{
    "project": {
        "name": "My Project",
        "description": "This is my project."
    },
    "notification_rules": [
        {
            "conditions": [
                {
                    "assertion_name": "less_than",
                    "assertion_params": { "other": 0.85 },
                    "query_name": "test_status_percentage_query",
                    "query_params": {
                        "exclude_tag_ids": [],
                        "include_tag_ids": [],
                        "require_tag_ids": [],
                        "statuses": ["PASSED"]
                    }
                }
            ],
            "name": "Alert if passed tests are less than 85%",
            "notification_integration_names": ["Notification channel"],
            "status": "ENABLED",
            "trigger": "test_session.failed"
        }
    ],
    "tags": [
        {
            "name": "my-tag",
            "description" :"This is my tag."
        }
    ],
    "test_specs": [
        {
            "assertion": { "name": "less_than", "params": { "other": 0.5 } },
            "description": "Testing the difference in the example statistic",
            "name": "Gr.0: Non Parametric Difference: Example_Statistic",
            "statistic_inputs": [
                {
                    "select_query_template": {
                        "filter": null,
                        "select": "{EXPERIMENT}.Example_Statistic"
                    }
                },
                {
                    "select_query_template": {
                        "filter": null,
                        "select": "{BASELINE}.Example_Statistic"
                    }
                }
            ],
            "statistic_name": "my_stat",
            "statistic_params": {},
            "tag_names": ["my-tag"]
        }
    ]
}
import dbnl
dbnl.login()


proj = dbnl.get_or_create_project(name="test_proj")
export_json = dbnl.export_project_as_json(project=proj)

assert export_json["project"]["name"] == "test_proj"
dbnl.get_column_results(*, run: Run) → DataFrame
import dbnl
import pandas as pd
dbnl.login()


proj = dbnl.get_or_create_project(name="test_p1")
uploaded_data = pd.DataFrame({"error": [0.11, 0.33, 0.52, 0.24]})
run = dbnl.report_run_with_results(
    project=proj,
    column_results=test_data,
)

downloaded_data = dbnl.get_column_results(run=run)
assert downloaded_data.equals(uploaded_data)
dbnl.get_latest_run(project: Project) → Run
dbnl.get_latest_run_config(project: Project) → RunConfig
dbnl.get_metric_by_id(*, metric_id: str) → Metric
dbnl.get_my_namespaces() → list[Any]
dbnl.get_or_create_project(*, name: str, description: str | None = None) → Project
import dbnl
dbnl.login()


proj_1 = dbnl.create_project(name="test_p1")
proj_2 = dbnl.get_or_create_project(name="test_p1")

# Calling get_or_create_project will yield same Project object
assert proj_1.id == proj_2.id
dbnl.get_project(*, name: str) → Project
import dbnl
dbnl.login()


proj_1 = dbnl.create_project(name="test_p1")
proj_2 = dbnl.get_project(name="test_p1")

# Calling get_project will yield same Project object
assert proj_1.id == proj_2.id

# DBNLProjectNotFoundError: A dnnl Project with name not_exist does not exist
proj_3 = dbnl.get_project(name="not_exist")
dbnl.get_results(*, run: Run) → ResultData
import dbnl
import pandas as pd
dbnl.login()


proj = dbnl.get_or_create_project(name="test_p1")

uploaded_data = pd.DataFrame({"error": [0.11, 0.33, 0.52, 0.24]})
run = dbnl.report_run_with_results(
    project=proj,
    column_results=uploaded_data,
)

downloaded_data = dbnl.get_results(run=run)
assert downloaded_data.columns.equals(uploaded_data)
dbnl.get_run(*, run_id: str) → Run
import dbnl
dbnl.login()


proj1 = dbnl.get_or_create_project(name="test_p1")
schema1 = dbnl.create_run_schema(project=proj1, columns=[{"name": "error", "type": "float"}])
run1 = dbnl.create_run(project=proj1, run_schema=schema1)

# Retrieving the Run by ID
run2 = dbnl.get_run(run_id=run1.id)
assert run1.id == run2.id

# DBNLRunNotFoundError: A Run with id run_0000000 does not exist.
run3 = dbnl.get_run(run_id="run_0000000")
dbnl.get_run_config(*, run_config_id: str) → RunConfig
dbnl.get_run_config_from_latest_run(project: Project) → RunConfig | None
dbnl.get_run_query(project: Project, name: str) → RunQuery
import dbnl
dbnl.login()


proj1 = dbnl.get_or_create_project(name="test_p1")
run_query1 = dbnl.get_run_query(
    project=project,
    name="look back 3"
)
dbnl.get_scalar_results(*, run: Run) → DataFrame
import dbnl
import pandas as pd
dbnl.login()

proj1 = dbnl.get_or_create_project(name="test_p1")

data = pd.DataFrame({"error": [0.11, 0.33, 0.52, 0.24]})
run = dbnl.report_run_with_results(
    project=proj,
    column_results=data,
    scalar_results={"rmse": 0.37}
)

downloaded_scalars = dbnl.get_scalar_results(run=run)
dbnl.import_project_from_json(*, params: dict[str, Any]) → Project
import dbnl
dbnl.login()


proj1 = dbnl.get_or_create_project(name="test_proj1")
export_json = dbnl.export_project_as_json(project=proj1)
export_json["project"]["name"] = "test_proj2"
proj2 = dbnl.import_project_from_json(params=export_json)

assert proj2.name == "test_proj2"
dbnl.login(*, api_token: str | None = None, namespace_id: str | None = None, api_url: str | None = None, app_url: str | None = None) → None
dbnl.report_column_results(*, run: Run, data: DataFrame) → None
import dbnl
import pandas as pd
dbnl.login()


proj1 = dbnl.get_or_create_project(name="test_p1")
schema1 = dbnl.create_run_schema(columns=[{"name": "error", "type": "float"}])
run1 = dbnl.create_run(project=proj1, run_schema=schema1)

data = pd.DataFrame({"error": [0.11, 0.33, 0.52, 0.24]})
dbnl.report_column_results(run=run1, data=data)
dbnl.report_results(*, run: Run, column_data: DataFrame, scalar_data: dict[str, Any] | DataFrame | None = None) → None
import dbnl
import pandas as pd
dbnl.login()


proj1 = dbnl.get_or_create_project(name="test_p1")
schema1 = dbnl.create_run_schema(
    columns=[{"name": "error", "type": "float"}],
    scalars=[{"name": "rmse": "type": "float"}],
)
run1 = dbnl.create_run(project=proj1, run_schema=schema1)
data = pd.DataFrame({"error": [0.11, 0.33, 0.52, 0.24]})
dbnl.report_results(run=run1, column_data=data, scalar_data={"rmse": 0.37})
dbnl.report_run_with_results(project: Project, column_data: DataFrame, scalar_data: dict[str, Any] | DataFrame | None = None, display_name: str | None = None, row_id: list[str] | None = None, index: list[str] | None = None, run_config_id: str | None = None, run_schema: RunSchema | None = None, metadata: dict[str, str] | None = None, wait_for_close: bool = True) → Run
import dbnl
import pandas as pd
dbnl.login()


proj = dbnl.get_or_create_project(name="test_p1")
test_data = pd.DataFrame({"error": [0.11, 0.33, 0.52, 0.24]})

run = dbnl.report_run_with_results(
    project=proj,
    column_data=test_data,
)
import dbnl
import pandas as pd
dbnl.login()


proj = dbnl.get_or_create_project(name="test_p1")
test_data = pd.DataFrame({"error": [0.11, 0.33, 0.52, 0.24]})
run_schema = dbnl.create_run_schema(columns=[
{"name": "error", "type": "float"}
])

run = dbnl.report_run_with_results(
    project=proj,
    column_data=test_data,
    run_schema=run_schema
)

try:
run_schema = dbnl.create_run_schema(columns=[
    {"name": "error", "type": "string"}
])
dbnl.report_run_with_results(
    project=proj,
    column_data=test_data,
    run_schema=run_schema
)
except DBNLInputValidationError:
# We expect DBNLInputValidationError because the type of
# `error` in the input data is "float", but we provided a `RunSchema`
# which specifies the columm type as "string".
assert True
else:
# should not get here
assert False
dbnl.report_run_with_results_and_start_test_session(*, project: Project, column_data: DataFrame, scalar_data: dict[str, Any] | DataFrame | None = None, display_name: str | None = None, row_id: list[str] | None = None, index: list[str] | None = None, run_config_id: str | None = None, run_schema: RunSchema | None = None, metadata: dict[str, str] | None = None, baseline: Run | RunQuery | None = None, include_tags: list[str] | None = None, exclude_tags: list[str] | None = None, require_tags: list[str] | None = None) → Run
import dbnl
import pandas as pd
dbnl.login()


proj = dbnl.get_or_create_project(name="test_p1")
test_data = pd.DataFrame({"error": [0.11, 0.33, 0.52, 0.24]})

run = dbnl.report_run_with_results_and_start_test_session(
    project=proj,
    column_data=test_data,
)
dbnl.report_scalar_results(*, run: Run, data: dict[str, Any] | DataFrame) → None
import dbnl
import pandas as pd
dbnl.login()


proj1 = dbnl.get_or_create_project(name="test_p1")
schema1 = dbnl.create_run_schema(
    columns=[{"name": "error", "type": "float"}],
    scalars=[{"name": "rmse": "type": "float"}],
)
run1 = dbnl.create_run(project=proj1, run_schema=schema1)
dbnl.report_scalar_results(run=run1, data={"rmse": 0.37})
dbnl.set_run_as_baseline(*, run: Run) → None
dbnl.set_run_query_as_baseline(*, run_query: RunQuery) → None
dbnl.wait_for_run_close(*, run: Run, timeout_s: float = 180.0, polling_interval_s: float = 3.0) → Run
the SDK reference
SDK function

CLI

The dbnl CLI is installed as part of the SDK and allows for interacting with the dbnl platform from the command line.

To install the SDK, run:

pip install dbnl

dbnl

The dbnl CLI.

dbnl [OPTIONS] COMMAND [ARGS]...

Options

--version

Show the version and exit.

info

Info about SDK and API.

dbnl info [OPTIONS]

login

Login to dbnl.

dbnl login [OPTIONS] API_TOKEN

Options

--api-url <api_url>

API url

--app-url <app_url>

App url

--namespace-id <namespace_id>

Namespace id

Arguments

API_TOKEN

Required argument

Environment variables

DBNL_API_TOKEN

Provide a default for API_TOKEN

DBNL_API_URL

Provide a default for --api-url

DBNL_APP_URL

Provide a default for --app-url

DBNL_NAMESPACE_ID

Provide a default for --namespace-id

logout

Logout of dbnl.

dbnl logout [OPTIONS]

sandbox

Subcommand to interact with the sandbox.

dbnl sandbox [OPTIONS] COMMAND [ARGS]...

delete

Delete sandbox data.

dbnl sandbox delete [OPTIONS]

exec

Exec a command on the sandbox.

dbnl sandbox exec [OPTIONS] [COMMAND]...

Arguments

COMMAND

Optional argument(s)

logs

Tail the sandbox logs.

dbnl sandbox logs [OPTIONS]

start

Start the sandbox.

dbnl sandbox start [OPTIONS]

Options

-u, --registry-username <registry_username>

Registry username

-p, --registry-password <registry_password>

Required Registry password

--registry

Registry

  • Default:'us-docker.pkg.dev/dbnlai/images'

--version

Sandbox version

  • Default:'0.23'

--base-url <base_url>

Sandbox base url

  • Default:'http://localhost:8080'

status

Get sandbox status.

dbnl sandbox status [OPTIONS]

stop

Stop the sandbox.

dbnl sandbox stop [OPTIONS]

Eval Module

Many generative AI applications focus on text generation. It can be challenging to create metrics for insights into expected performance when dealing with unstructured text.

dbnl.eval is a special module designed for evaluating unstructured text. This module currently includes:

  • Adaptive metric sets for generic text and RAG applications

  • 12+ simple statistical local library powered text metrics

  • 15+ LLM-as-judge and embedding powered text metrics

  • Support for user-defined custom LLM-as-judge metrics

  • LLM-as-judge metrics compatible with OpenAI, Azure OpenAI

Building dbnl tests on these evaluation metrics can then drive rich insights into an AI application's stability and performance.

Classes

Project

dbnl.sdk.models.Project(id: 'str', name: 'str', description: 'Optional[str]' = None)

description : str | None = None

id : str

name : str

Run

dbnl.sdk.models.Run(id: 'str', project_id: 'str', run_config_id: 'Optional[str]' = None, display_name: 'Optional[str]' = None, metadata: 'Optional[dict[str, str]]' = None, run_config: 'Optional[RunConfig]' = None, run_schema: 'Optional[RunSchema]' = None, status: "Optional[Literal['pending', 'closing', 'closed']]" = None)

display_name : str | None = None

id : str

metadata : dict[str, str] | None = None

project_id : str

run_config : RunConfig | None = None

run_config_id : str | None = None

run_schema : RunSchema | None = None

status : Literal['pending', 'closing', 'closed'] | None = None

RunQuery

dbnl.sdk.models.RunQuery(id: 'str', project_id: 'str', name: 'str', query: 'dict[str, Any]')

id : str

name : str

project_id : str

query : dict[str, Any]

TestSession

dbnl.sdk.models.TestSession(id: 'str', project_id: 'str', inputs: 'list[TestSessionInput]', status: "Literal['PENDING', 'RUNNING', 'PASSED', 'FAILED']", failure: 'Optional[str]' = None, num_tests_passed: 'Optional[int]' = None, num_tests_failed: 'Optional[int]' = None, num_tests_errored: 'Optional[int]' = None, include_tag_ids: 'Optional[list[str]]' = None, exclude_tag_ids: 'Optional[list[str]]' = None, require_tag_ids: 'Optional[list[str]]' = None)

exclude_tag_ids : list[str] | None = None

failure : str | None = None

id : str

include_tag_ids : list[str] | None = None

inputs : list[TestSessionInput]

num_tests_errored : int | None = None

num_tests_failed : int | None = None

num_tests_passed : int | None = None

project_id : str

require_tag_ids : list[str] | None = None

status : Literal['PENDING', 'RUNNING', 'PASSED', 'FAILED']

TestRecalibrationSession

dbnl.sdk.models.TestRecalibrationSession(id: 'str', project_id: 'str', test_session_id: 'str', feedback: 'str', status: "Literal['PENDING', 'RUNNING', 'COMPLETED', 'FAILED']", test_ids: 'Optional[list[str]]' = None, failure: 'Optional[str]' = None)

failure : str | None = None

feedback : str

id : str

project_id : str

status : Literal['PENDING', 'RUNNING', 'COMPLETED', 'FAILED']

test_ids : list[str] | None = None

test_session_id : str

TestGenerationSession

dbnl.sdk.models.TestGenerationSession(id: 'str', project_id: 'str', run_id: 'str', status: "Literal['PENDING', 'RUNNING', 'COMPLETED', 'FAILED']", columns: 'Optional[list[dict[str, str]]]' = None, failure: 'Optional[str]' = None, num_generated_tests: 'Optional[int]' = None)

columns : list[dict[str, str]] | None = None

failure : str | None = None

id : str

num_generated_tests : int | None = None

project_id : str

run_id : str

status : Literal['PENDING', 'RUNNING', 'COMPLETED', 'FAILED']

ResultData

dbnl.sdk.models.ResultData(columns, scalars)

columns : DataFrame

Alias for field number 0

scalars : DataFrame | None

Alias for field number 1

RunSchema

dbnl.sdk.models.RunSchema(columns: 'list[RunSchemaColumnSchema]', scalars: 'Optional[list[RunSchemaScalarSchema]]' = None, index: 'Optional[list[str]]' = None, components_dag: 'Optional[dict[str, list[str]]]' = None)

columns : list[RunSchemaColumnSchema]

components_dag : dict[str, list[str]] | None = None

index : list[str] | None = None

scalars : list[RunSchemaScalarSchema] | None = None

Users and Permissions
Pre-signed URLs
text metrics
v1/chat/completions
v1/embeddings

Application Metric Sets

text_metrics()

Basic metrics for generic text comparison and monitoring

  • token_count

  • word_count

  • flesch_kincaid_grade

  • automated_readability_index

  • bleu

  • levenshtein

  • rouge1

  • rouge2

  • rougeL

  • rougeLsum

  • llm_text_toxicity_v0

  • llm_sentiment_assessment_v0

  • llm_reading_complexity_v0

  • llm_grammar_accuracy_v0

  • inner_product

  • llm_text_similarity_v0

question_and_answer_metrics()

Basic metrics for RAG / question answering

  • llm_accuracy_v0

  • llm_completeness_v0

  • answer_similarity_v0

  • faithfulness_v0

  • mrr

  • context_hit

The metric set helpers are adaptive in that :

  1. The metrics returned encode which columns of the dataframe are input to the metric computation e.g., rougeL_prediction__ground_truth is the rougeL metric run with both the column named prediction and the column named ground_truth as input

  2. The metrics returned support any additional optional column info and LLM-as-judge or embedding model clients. If any of this optional info is not provided, the metric set will exclude any metrics that depend on that information

create_test_session
Data upload

The metric set helpers return an adaptive list of metrics, relevant to the application type. See the for details on all the metric functions available in the eval SDK.

See the for concrete examples of adaptive text_metrics() usage

See the for question_and_answer_metrics() usage

def text_metrics(
    prediction: str,
    target: Optional[str] = None,
    eval_llm_client: Optional[LLMClient] = None,
    eval_embedding_client: Optional[EmbeddingClient] = None,
) -> list[Metric]:
    """
    Returns a set of metrics relevant for a generic text application

    :param prediction: prediction column name (i.e. generated text)
    :param target: target column name (i.e. expected text)
    :return: list of metrics
    """

dbnl.eval

create_run_schema_from_results

dbnl.eval.create_run_schema_from_results(column_data: DataFrame, scalar_data: dict[str, Any] | DataFrame | None = None, index: list[str] | None = None, metrics: Sequence[[Metric](dbnl.eval.metrics.md#dbnl.eval.metrics.Metric)] | None = None) → RunSchema

Create a new RunSchema from column results, scalar results, and metrics.

This function assumes that the metrics have already been evaluated on the original, un-augmented data. In other words, the column data for the metrics should also be present in the column_data.

  • Parameters:

    • column_data – DataFrame with the results for the columns

    • scalar_data – Dictionary or DataFrame with the results for the scalars, defaults to None

    • index – List of column names that are the unique identifier, defaults to None

    • metrics – List of metrics to report with the run, defaults to None

  • Raises:DBNLInputValidationError – Input does not conform to expected format

  • Returns: RunSchema with the desired schema for columns and scalars, if provided

evaluate

dbnl.eval.evaluate(df: DataFrame, metrics: Sequence[[Metric](dbnl.eval.metrics.md#dbnl.eval.metrics.Metric)], inplace: bool = False) → DataFrame

Evaluates a set of metrics on a dataframe, returning an augmented dataframe.

  • Parameters:

    • df – input dataframe

    • metrics – metrics to compute

    • inplace – whether to modify the input dataframe in place

  • Returns: input dataframe augmented with metrics

get_column_schemas_from_dataframe_and_metrics

dbnl.eval.get_column_schemas_from_dataframe_and_metrics(df: DataFrame, metrics: list[[Metric](dbnl.eval.metrics.md#dbnl.eval.metrics.Metric)]) → list[RunSchemaColumnSchemaDict]

Gets the run schema column schemas for a dataframe that was augmented with a list of metrics.

  • Parameters:

    • df – Dataframe to get column schemas from

    • metrics – list of metrics added to the dataframe

  • Returns: list of columns schemas for dataframe and metrics

get_column_schemas_from_metrics

dbnl.eval.get_column_schemas_from_metrics(metrics: list[[Metric](dbnl.eval.metrics.md#dbnl.eval.metrics.Metric)]) → list[RunSchemaColumnSchemaDict]

Gets the run schema column schemas from a list of metrics.

  • Parameters:metrics – list of metrics to get column schemas from

  • Returns: list of column schemas for metrics

get_run_schema_columns_from_metrics

dbnl.eval.get_run_schema_columns_from_metrics(metrics: list[[Metric](dbnl.eval.metrics.md#dbnl.eval.metrics.Metric)]) → list[RunSchemaColumnSchema]

Gets the run schema column schemas from a list of metrics.

  • Parameters:metrics – list of metrics to get column schemas from

  • Returns: list of column schemas for metrics

report_run_with_results

dbnl.eval.report_run_with_results(project: Project, column_data: DataFrame, scalar_data: dict[str, Any] | DataFrame | None = None, display_name: str | None = None, index: list[str] | None = None, run_schema: RunSchema | None = None, metadata: dict[str, str] | None = None, metrics: Sequence[[Metric](dbnl.eval.metrics.md#dbnl.eval.metrics.Metric)] | None = None, wait_for_close: bool = True) → Run

Create a new Run, report results to it, and close it.

If run_schema is not provided, a RunSchema will be created from the data. If a run_schema is provided, the results are validated against it.

If metrics are provided, they are evaluated on the column data before reporting.

  • Parameters:

    • project – DBNL Project to create the Run for

    • column_data – DataFrame with the results for the columns

    • scalar_data – Dictionary or DataFrame with the results for the scalars, if any. Defaults to None

    • display_name – Display name for the Run, defaults to None.

    • index – List of column names that are the unique identifier, defaults to None. Only used when creating a new schema.

    • run_schema – RunSchema to use for the Run, defaults to None.

    • metadata – Additional key:value pairs user wants to track, defaults to None

    • metrics – List of metrics to report with the run, defaults to None

    • wait_for_close – If True, the function will block for up to 3 minutes until the Run is closed, defaults to True

  • Raises:

    • DBNLNotLoggedInError – DBNL SDK is not logged in

    • DBNLInputValidationError – Input does not conform to expected format

  • Returns: Run, after reporting results and closing it

report_run_with_results_and_start_test_session

dbnl.eval.report_run_with_results_and_start_test_session(*, project: Project, column_data: DataFrame, scalar_data: dict[str, Any] | DataFrame | None = None, display_name: str | None = None, index: list[str] | None = None, run_schema: RunSchema | None = None, metadata: dict[str, str] | None = None, baseline: Run | RunQuery | None = None, include_tags: list[str] | None = None, exclude_tags: list[str] | None = None, require_tags: list[str] | None = None, metrics: Sequence[[Metric](dbnl.eval.metrics.md#dbnl.eval.metrics.Metric)] | None = None) → Run

Create a new Run, report results to it, and close it. Start a TestSession with the given inputs. If metrics are provided, they are evaluated on the column data before reporting.

  • Parameters:

    • project – DBNL Project to create the Run for

    • column_data – DataFrame with the results for the columns

    • scalar_data – Dictionary or DataFrame with the scalar results to report to DBNL, defaults to None.

    • display_name – Display name for the Run, defaults to None.

    • index – List of column names that are the unique identifier, defaults to None. Only used when creating a new schema.

    • run_schema – RunSchema to use for the Run, defaults to None.

    • metadata – Additional key:value pairs user wants to track, defaults to None

    • baseline – DBNL Run or RunQuery to use as the baseline run, defaults to None. If None, the baseline defined in the TestConfig is used.

    • include_tags – List of Test Tag names to include in the Test Session

    • exclude_tags – List of Test Tag names to exclude in the Test Session

    • require_tags – List of Test Tag names to require in the Test Session

    • metrics – List of metrics to report with the run, defaults to None

  • Raises:

    • DBNLNotLoggedInError – DBNL SDK is not logged in

    • DBNLInputValidationError – Input does not conform to expected format

  • Returns: Run, after reporting results and closing it

dbnl.eval.metrics reference
How-To section
RAG example
dbnl.eval.report_run_with_results

Quick Start

  1. Create a client to power LLM-as-judge text metrics [optional]

  2. Generate a list of metrics suitable for comparing text_A to reference text_B

  3. Use dbnl.eval to evaluate to compute the list metrics.

  4. Publish the augmented dataframe and new metric quantities to DBNL

import dbnl
import os
import pandas as pd
from openai import OpenAI
from dbnl.eval.llm import OpenAILLMClient
from dbnl.eval import evaluate

# 1. create client to power LLM-as-judge metrics
base_oai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
oai_client = OpenAILLMClient.from_existing_client(base_oai_client, llm_model="gpt-3.5-turbo-0125")

eval_df = pd.DataFrame(
    [
        { "prediction":"France has no capital",
          "ground_truth": "The capital of France is Paris",},
        { "prediction":"The capital of France is Toronto",
          "ground_truth": "The capital of France is Paris",},
        { "prediction":"Paris is the capital",
          "ground_truth": "The capital of France is Paris",},
    ] * 4
)

# 2. get text metrics that use target (ground_truth) and LLM-as-judge metrics
text_metrics = dbnl.eval.metrics.text_metrics(
    prediction="prediction", target="ground_truth", eval_llm_client=oai_client
)
# 3. run text metrics that use target (ground_truth) and LLM-as-judge metrics
aug_eval_df = evaluate(eval_df, text_metrics)

# 4. publish to DBNL
dbnl.login(api_token=os.environ["DBNL_API_TOKEN"])
project = dbnl.get_or_create_project(name="DEAL_testing")
cols = dbnl.util.get_column_schemas_from_dataframe(aug_eval_df)
run_schema = dbnl.create_run_schema(columns=cols)
run = dbnl.create_run(project=project, run_schema=run_schema)
dbnl.report_results(run=run, column_data=aug_eval_df)
dbnl.close_run(run=run)

You can inspect a subset of the the aug_eval_df rows and for example, one of the columns created by one of the metrics in the text_metrics list : llm_text_similarity_v0

idx
prediction
ground_truth
llm_text_similarity_v0__prediction__ground_truth

0

France has no capital

The capital of France is Paris

1

1

The capital of France is Toronto

The capital of France is Paris

1

2

Paris is the capital

The capital of France is Paris

5

The values of llm_text_similarity_v0qualitatively match our expectations on semantic similarity between the prediction and ground_truth

def evaluate(df: pd.DataFrame, metrics: Sequence[Metric], inplace: bool = False) -> pd.DataFrame:
    """
    Evaluates a set of metrics on a dataframe, returning an augmented dataframe.

    :param df: input dataframe
    :param metrics: metrics to compute
    :param inplace: whether to modify the input dataframe in place
    :return: input dataframe augmented with metrics
    """

The column names of the metrics in the returned dataframe include the metric name and the columns that were used in that metrics computation

For example the metric named llm_text_similarity_v0 becomes llm_text_similarity_v0__prediction__ground_truth because it takes as input both the column named prediction and the column named ground_truth

How-To / FAQ

What if I do not have an LLM service to run LLM-as-judge metrics?

No problem, just don’t include an eval_llm_client or an eval_embedding_client argument in the call(s) to the evaluation helpers. The helpers will automatically exclude any metrics that depend on them.

What if I do not have ground-truth available?

No problem. You can simply remove the target argument from the helper. The metric set helper will automatically exclude any metrics that depend on the target column being specified.

There is an additional helper that can generate a list of generic metrics appropriate for “monitoring” unstructured text columns : text_monitor_metrics(). Simply provide a list of text column names and optionally an eval_llm_client for LLM-as-judge metrics.

How do I create a custom LLM-as-judge metric?

You can write your own LLM-as-judge metric that uses your custom prompt. The example below defines a custom LLM-as-judge metric and runs it on an example dataframe.

You can also write a metric that includes only the prediction column specified and reference only {prediction} in the custom prompt. An example is below:

RAG / Question Answer Example

In RAG (retrieval-augmented generation or "question and answer") applications, the high level goal is:

Given a question, generate an answer that adheres to knowledge in some corpus

However, this is easier said than done. Data is often collected at various steps in the RAG process to help evaluate which steps might be performing poorly or not as expected. This data can help understand the following:

  1. What question was asked?

  2. Which documents / chunks (ids) were retrieved?

  3. What was the text of those retrieved documents / chunks?

  4. From the retrieved documents, what was the top-ranked document and its id?

  5. What is the expected answer?

  6. What is the expected document id and text that contains the answer to the question?

  7. What was the generated answer?

Having data that answers some or all of these questions allows for evaluations to run, producing metrics that can highlight what part of the RAG system is performing in unexpected ways.

The short example below demonstrates what a dataframe with rich contextual data would look like for and how to use dbnl.eval to generate relevant metrics

You can inspect a subset of the the aug_eval_df rows and examine, for example, the metrics related to retrieval and answer similarity

We can see the first result (idx = 0) represents a complete failure of the RAG system. The relevant documents were not retrieved (mrr = 0.0) and the generated answer is very dissimilar from the expected answer (answer_similarity = 1).

The second result (idx = 1) represents a better response from the RAG system. The relevant document was retrieved, but ranked lower (mrr = 0.33333) and the answer is somewhat similar to the expected answer (answer_similarity = 3)

The final result (idx = 2) represents a strong response from the RAG system. The relevant document was retrieved and top ranked (mrr = 1.0) and the generated answer is very similar to the expected answer (answer_similarity = 5)

The signature for question_and_answer_metrics() highlights its adaptability. Again, the optional arguments are not required and the helper will intelligently return only the metrics that depend on the info that is provided.

To use dbnl.eval, you will need to install the extra 'eval' package as described in .

The call to takes a dataframe and metric list as input and returns a dataframe with extra columns. Each new column holds the value of a metric computation for that row

idx
mrr__gt_reference_doc_id__top_k_retrieved_doc_ids
answer_similarity_v0__generated_answer_question_text_ground_truth_answer
these instructions
evaluate()
# BEFORE : default text metrics including those requiring target (ground_truth) and LLM-as-judge
text_metrics = dbnl.eval.metrics.text_metrics(
    prediction="prediction", target="ground_truth", eval_llm_client=oai_client
)

# AFTER : remove the eval_llm_client to exclude LLM-as-judge metrics
text_metrics = dbnl.eval.metrics.text_metrics(
    prediction="prediction", target="ground_truth"
)

aug_eval_df = evaluate(eval_df, text_metrics)
# BEFORE : default text metrics, including those requiring target (ground_truth) and LLM-as-judge
text_metrics = dbnl.eval.metrics.text_metrics(
    prediction="prediction", target="ground_truth", eval_llm_client=oai_client
)

# AFTER : remove the target to remove metrics that depend on that value being specified
text_metrics = dbnl.eval.metrics.text_metrics(
    prediction="prediction", eval_llm_client=oai_client
)

aug_eval_df = evaluate(eval_df, text_metrics)
# get text metrics for each column in list
monitor_metrics = dbnl.eval.metrics.text_monitor_metrics(
  ["prediction", "input"], eval_llm_client=oai_client
)

aug_eval_df = evaluate(eval_df, monitor_metrics)
import dbnl
import os
import pandas as pd
from openai import OpenAI
from dbnl.eval.llm import OpenAILLMClient
from dbnl.eval import evaluate
from dbnl.eval.metrics.mlflow import MLFlowGenAIFromPromptEvaluationMetric
from dbnl.eval.metrics.metric import Metric
from dbnl.eval.llm.client import LLMClient

# 1. create client to power LLM-as-judge metrics
base_oai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
oai_client = OpenAILLMClient.from_existing_client(base_oai_client, llm_model="gpt-3.5-turbo-0125")

eval_df = pd.DataFrame(
    [
        { "prediction":"France has no capital",
          "ground_truth": "The capital of France is Paris",},
        { "prediction":"The capital of France is Toronto",
          "ground_truth": "The capital of France is Paris",},
        { "prediction":"Paris is the capital",
          "ground_truth": "The capital of France is Paris",},
    ] * 4
)

# 2. define a custom LLM-as-judge metric
def custom_text_similarity(prediction: str, target: str, eval_llm_client: LLMClient) -> Metric:
    custom_prompt_v0 = """
      Given the generated text : {prediction}, score the semantic similarity to the reference text : {target}. 

      Rate the semantic similarity from 1 (completely different meaning and facts between the generated and reference texts) to 5 (nearly the exact same semantic meaning and facts present in the generated and reference texts).

      Example output, make certain that 'score:' and 'justification:' text is present in output:
      score: 4
      justification: XYZ
    """
    
    return MLFlowGenAIFromPromptEvaluationMetric(
        name="custom_text_similarity",
        judge_prompt=custom_prompt_v0,
        prediction=prediction,
        target=target,
        eval_llm_client=eval_llm_client,
        version="v0",
    )

# 3. instantiate the custom LLM-as-judge metric
c_metric = custom_text_similarity(
  prediction='prediction', target='ground_truth', eval_llm_client=oai_client
)
# 4. run only the custom LLM-as-judge metric
aug_eval_df = evaluate(eval_df, [c_metric])
def custom_text_simplicity(prediction: str, target: str, eval_llm_client: LLMClient) -> Metric:
    custom_prompt_v0 = """
      Given the generated text : {prediction}, score the text from 1 to 5 based on whether it is written in simple, easy to understand english 

      Rate the generated text from 5 (completely simple english, very commonly used words, easy to explain vocabulary) to 1 (complex english, uncommon words, difficult to explain vocabulary).

      Example output, make certain that 'score:' and 'justification:' text is present in output:
      score: 4
      justification: XYZ
    """
    
    return MLFlowGenAIFromPromptEvaluationMetric(
        name="custom_text_simplicity",
        judge_prompt=custom_prompt_v0,
        prediction=prediction,
        target=target,
        eval_llm_client=eval_llm_client,
        version="v0",
    )
import dbnl
import os
import pandas as pd
from openai import OpenAI
from dbnl.eval.llm import OpenAILLMClient
from dbnl.eval.embedding_clients import OpenAIEmbeddingClient
from dbnl.eval import evaluate

# 1. create client to power LLM-as-judge and embedding metrics [optional]
base_oai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
eval_llm_client = OpenAILLMClient.from_existing_client(base_oai_client, llm_model="gpt-3.5-turbo-0125")
eval_embd_client = OpenAIEmbeddingClient.from_existing_client(base_oai_client, embedding_model="text-embedding-ada-002")

eval_df = pd.DataFrame(
    [
        {
         "question_text": "Is the protein Cathepsin secreted?",
         "top_k_retrieved_doc_texts": ["Some irrelevant document that the rag system retrieved"],    
         "top_k_retrieved_doc_ids":   ["4123"],
         "top_retrieved_doc_text":    "Some irrelevant document that the rag system retrieved",
         "gt_reference_doc_id": "1099",
         "gt_reference_doc_text": "The protein Cathepsin is known to be secreted",
         "ground_truth_answer": "Yes, Cathepsin is a secreted protein",
         "generated_answer":"I have no relevant knowledge",},
        {
         "question_text": "Is the protein Cathepsin secreted?",
         "top_k_retrieved_doc_texts": ["Some irrelevant document that the rag system retrieved", 
                                       "Many proteins are secreted such as hormones, enzymes, toxins",
                                       "The protein Cathepsin is known to be secreted"],    
         "top_k_retrieved_doc_ids":   ["4123","21","1099"],
         "top_retrieved_doc_text":    "Some irrelevant document that the rag system retrieved",
         "gt_reference_doc_id": "1099",
         "gt_reference_doc_text": "The protein Capilin is known to be secreted",
         "ground_truth_answer": "Yes, Cathepsin is a secreted protein",
         "generated_answer":"Many proteins are known to be secreted",},
        {
         "question_text": "Is the protein Cathepsin secreted?",
         "top_k_retrieved_doc_texts": ["The protein Cathepsin is known to be secreted", 
                                       "Some irrelevant document that the rag system retrieved"],    
         "top_k_retrieved_doc_ids":   ["1099","4123"],
         "top_retrieved_doc_text":    "The protein Cathepsin is known to be secreted",
         "gt_reference_doc_id": "1099",
         "gt_reference_doc_text": "The protein Cathepsin is known to be secreted",
         "ground_truth_answer": "Yes, Cathepsin is a secreted protein",
         "generated_answer":"Yes, cathepsin is a secreted protein",},
    ] * 4
)

# 2. get text metrics appropriate for RAG / QA systems
qa_text_metrics = dbnl.eval.metrics.question_and_answer_metrics(
  prediction="generated_answer", input="question_text", target="ground_truth_answer",
  context="top_k_retrieved_doc_texts", top_retrieved_document_text="top_retrieved_doc_text",
  retrieved_document_ids="top_k_retrieved_doc_ids", ground_truth_document_id="gt_reference_doc_id",
  eval_llm_client=eval_llm_client, eval_embedding_client=eval_embd_client
)
# 3. run qa text metrics
aug_eval_df = evaluate(eval_df, qa_text_metrics)

# 4. publish to DBNL
dbnl.login()
project = dbnl.get_or_create_project(name="RAG_demo")
cols = dbnl.experimental.get_column_schemas_from_dataframe(aug_eval_df)
run_config = dbnl.create_run_config(project=project, columns=cols)
run = dbnl.create_run(project=project, run_config=run_config)
dbnl.report_results(run=run, data=aug_eval_df)
dbnl.close_run(run=run)

0

0.0

1

1

0.33333

3

2

1.0

5

def question_and_answer_metrics(
    prediction: str,
    target: Optional[str] = None,
    input: Optional[str] = None,
    context: Optional[str] = None,
    ground_truth_document_id: Optional[str] = None,
    retrieved_document_ids: Optional[str] = None,
    ground_truth_document_text: Optional[str] = None,
    top_retrieved_document_text: Optional[str] = None,
    eval_llm_client: Optional[LLMClient] = None,
    eval_embedding_client: Optional[EmbeddingClient] = None,
) -> list[Metric]:
    """
    Returns a set of metrics relevant for a question and answer task.

    :param prediction: prediction column name (i.e. generated answer)
    :param target: target column name (i.e. expected answer)
    :param input: input column name (i.e. question)
    :param context: context column name (i.e. document or set of documents retrieved)
    :param ground_truth_document_id: ground_truth_document_id containing the information in the target
    :param retrieved_document_ids: retrieved_document_ids containing the full context
    :param ground_truth_document_text: text containing the information in the target 
                                       (ideal is for this to be the top retrieved document)
    :param top_retrieved_document_text: text of the top retrieved document
    :param eval_llm_client: eval_llm_client
    :param eval_embedding_client: eval_embedding_client
    :return: list of metrics
    """

DBNL Eval

Release Notes

April 14, 2025 - Version 0.22.0

New Features

  • Similarity Index: Added initial computation, Likert scale support, history charts, test creation, and UI enhancements.

  • Metrics System: Introduced metric APIs, creation forms, and computation jobs

  • UI/UX:

    • Key Insights and summary detail view improvements

    • Summary chips, tooltips, sortable tables

    • Various UI and UX improvements, including column metrics pages and test creation shortcuts

  • Schema & Typing:

    • Parametrized type handling in UI

    • Improved type system: to_json_value, nullability, JSON unnesting

    • Schema unification

Improvements

  • Helm & Dependency Management:

    • Updated helm charts and lock files

    • Repinned/upgraded Python and JS dependencies (e.g., alembic, ruff, identify)

  • UI/UX:

    • Improved Summary Tab and Test Session views

    • Fixed overlay defaults, sorted metrics table, and chart tooltips

    • Responsive layout tweaks and consistent styling

  • General improvements:

    • Code quality, cleanup (deprecated / legacy code), and improved organization

Bug Fixes

  • UI/UX: Fixed navigation issues, loading states, pagination, and flaky links.

  • Infrastructure: Resolved Helm/chart/tagging issues, GitHub Actions bugs, and sandbox setup problems.

  • Testing: Addressed test failures and integration test inconsistencies.

SDK Updates

  • RunConfig or RunSchema for Runs (defaults to RunSchema on infer)

    • Integrated metrics into the RunSchema object

    • RunConfigdeprecated for future releases

  • Enabled metric creation and deletion

  • Added wait_for_run_close utility (now default behavior)

  • Improved command-line feedback and error handling (e.g., Docker not running)

  • Removed support for legacy types and deprecated DBNL_API_HOST

  • Adjusted version bounds for numpy and spacy

  • Fixed multiple issues with publishing wheels and builds

  • Improved SDK integration tests (including wait_for_close)

  • Cleaned up comments and enhanced docstrings

Mar 21, 2025 - Version 0.21.1

This patch release adds a critical bug fix to the sandbox authentication flow.

Bug Fixes

  • Fix a bug with the sandbox authentication flow that resulted in credentials being considered invalid

Mar 17, 2025 - Version 0.21.0

This release adds a sandbox environment with the dbnl platform which can be deployed to a single machine. Contact us for access!

New Features

  • New Sandbox deployment option

  • Support for RunSchema on Run creation in the API

Improvements

  • Add install option to set dev tokens expiration policy

  • Link to versioned documentation from the UI

  • Terms of service updates

  • Remove link to "View Test Analysis" page

Bug Fixes

  • [UI] Only allow closed runs when selecting a default run for comparison in UI

  • [UI] Allow selecting more than 10 columns in test session summary page

  • [SDK] Fixed default namespace in URL for project upon get_or_create

SDK Updates

  • New SDK CLI interface

  • Changes in minimum and maximum versions for some libraries (pyarrow, numpy, spacy)

Feb 18, 2025 - Version 0.20.0

This release adds a number of new features, improvements, and bug fixes.

New Features

  • New Test History feature in Test Spec detail page that enables users to understand a single tests' behavior over time, including its recalibration history

  • Added support for Slack notifications in UI

Improvements

  • Close runs in the UI

  • Viewable dbnl version in sidebar

  • UI performance improvements

  • Update color of links

  • Preserve input expressions in test spec editor

  • Extend scope of Project export

    • Export Tags by name for Project export

    • Better error messages for Project import

  • Improve namespace support for multi-org users

  • Miscellaneous package updates

  • Validate results on run close

  • /projects redirects to home page

Bug Fixes

  • Fix broken pagination

  • Fix broken Histogram title

  • Fix Results Table rendering issue for some Tests

  • Fix support for decimal values in assertion params

  • Fix rendering of = and != Assertions

  • Test spec editor navigation bugfix

New SDK Version

  • Check compatibility with API version

  • Add support for double and long values

  • Improved errors for invalid API URL configuration

  • Remove en-core-web-sm from requirements to enable PyPI support

  • Updated helm-charts for on-prem

Jan 13, 2025 - Version 0.19.0

Highlights in this version:

  • Improvements to the project import/export feature including support for notifications.

  • Support for new versioning and release process.

  • Better dependency management.

  • Too many bug fixes and UX improvements to list in detail.

This release includes several new features that allow users to more easily view and diagnose behavioral shifts in their test sessions. Check out , our way of quantifying drift, and create new tests based on the key insights we surface!

Similarity Index