1 of 100

v0.19.x

Introduction to AI Testing

Welcome to Distributional

Introduction to Distributional AI Testing Platform

At Distributional, we make AI testing easy, so you can build with confidence. Here’s how it works:

✅ Connect to your existing data sources.
🔄 Run automated tests on a regular schedule.
📢 Get alerts when your AI application needs your attention.

Simple, seamless, and built for peace of mind. Let's help you improve your AI uptime.

For getting access to the Distributional platform, please reach out to our team.

AI Test Cases

When do you need AI Testing? To get a sense of what testing could look like in practice, here are some questions that you can answer through AI Testing across the AI Software Development Lifecycle:

During Development

During Deployment

During Production

How well do my evals map to production behavior?
When do I increase coverage in my golden dataset to address edge cases?
If something changes in one component of my application, how do I assess the cascading impact to other dependent components?
Are end-users catching issues that my evals miss?

How do I compare new application updates to prior ones in my production environment?
What’s the impact to behavior if the model, data, or usage shifts?
Do I know when shifts happen?
How do I understand what’s causing anomalous behavior?

What happens if I switch to another LLM?
How do I give other teams visibility into shifts that happen?
What’s the impact of pushing any change to my application? Am I able to push changes to production or is it a new dev cycle?

If you are interested in finding the answers to the above, Distributional can help. The Distributional platform provides a standardized approach to AI testing across all of your applications.

Ready to start using Distributional? Head straight to Getting Started to get set up on the platform and start testing your AI application.

If you would first like to learn more about the Distributional platform, head over to the Learning About Distributional section. If you are new to AI Testing or would like to know how it fits in to the AI Software Development Cycle, continue forward in the Introduction to AI Testing section, starting with the Motivation and What is AI Testing pages.

Motivation

Why AI Testing Matters

Problem

Unlike traditional software applications, AI applications exhibit continuously evolving behavior due to factors such as updates to language models, shifts in context, data drift, and changes in prompt engineering or system integrations. Changing behavior is of great concern to almost any organization building these applications, and is something that they constantly need to address.

If not addressed it is hard for organizations to:

Defining Desired Behavior

Organizations struggle to define what "good" looks like for their AI applications. Current metrics fail to capture the full scope of desired behavior because:

Context Matters – AI behavior differs across applications, use cases, and environments.
Metrics Have Limits – No single set of measurements can fully capture performance across all dimensions.
Evolving Requirements – Business needs, regulations, and user expectations shift over time, requiring continuous refinement.

Understanding and Addressing Changes

When applications changes from expected behavior, organizations struggle to:

Pinpointing Root Causes – Identifying why behavior changed can be complex.
Adapting vs. Fixing – Deciding whether to refine behavior definitions or modify the application itself.
Prioritizing Solutions – Knowing where to start when addressing issues.
Ensuring Effectiveness – Validating that changes lead to meaningful improvements.

The uncertainty of AI applications often prevents AI applications from seeing the light of day. For those that reach production, teams lack the tools to effectively monitor and maintain their performance.

Solution

Distributional helps organizations understand and track how your AI application behaves in two main ways:

First, it watches your application's inputs and outputs over time, building a picture of “expected” behavior. Think of it like your AI applications fingerprint - when your applications fingerprint changes Distributional notices and alerts you. If something goes wrong, it shows you exactly what changed and helps you figure out why, making it easier to fix problems quickly.

Secondly, the system keeps track of everything it learns about your application's behavior, saving these insights for future use. Organizations can apply what they learn from one AI application to similar ones, helping them test and improve new applications more quickly. This creates a consistent way to test AI applications across an organization and ensures that your AI application reaches maximum up-time.

What is AI Testing?

Understanding AI Testing

Unlike traditional software applications that follow a straightforward input-to-output path, AI applications present unique testing challenges. A typical AI system combines multiple components like search systems, vector databases, LLM APIs, and machine learning models. These components work together, creating a more complex system than traditional software.

Traditional software testing works by checking if a function produces an expected output given a specific input. For example, a payment processing function should always calculate the same total given the same items and tax rate. However, this approach falls short for AI applications for three key reasons, as illustrated in the diagram:

First, AI applications are multi-component systems where changes in one part can affect others in unexpected ways. For instance, a change in your vector database could affect your LLM's responses, or updates to a feature pipeline could impact your machine learning model's predictions.
Second, AI applications are non-stationary, meaning their behavior changes over time even if you don't change the code. This happens because the world they interact with changes - new data comes in, language patterns evolve, and third-party models get updated. A test that passes today might fail tomorrow, not because of a bug, but because the underlying conditions have shifted.
Third, AI applications are non-deterministic. Even with the exact same input, they might produce different outputs each time. Think of asking an LLM the same question twice - you might get two different, but equally valid, responses. This makes it impossible to write traditional tests that expect exact matches.

Stages in the AI Software Development Lifecycle

Key Stages in the AI Development Lifecycle

The AI Software Development Life Cycle (SDLC) differs from the traditional SDLC in that organizations typically progress through these stages in cycles, rather than linearly. A GenAI project might start with rapid prototyping in the Build phase, circle back to Explore for data analysis, then iterate between Build and Deploy as the application matures.

The unique challenges of AI systems require a specialized approach to testing throughout the entire life cycle. Testing happens across four key stages, each addressing different aspects of AI behavior: Explore, Build, Deploy, and Observe.

Explore - The identification and isolation of motivating factors (often business factors) which have inspired the creation/augmentation of the AI-powered app.
Build - Iteration through possible designs and constraints to produce something viable.
Deploy - The process of converting the developed AI-powered app into a service which can be deployed robustly.
Observe - Continual review and analysis of the behavior/health of the AI-powered app, including notifying interested parties of discordant behavior and motivating new Build goals. Without continuous feedback from the Observe phase, there’s a substantial risk that the application does not behave as expected - for example, the output of an LLM could provide nonsensical or incorrect responses. Additionally, the performance of the application degrades over time as the distributions of the input data shifts.

Testing Across the AI SDLC

Existing AI reliability methods - such as evaluations and monitoring - play crucial roles in the AI SDLC. However, they often focus on narrow aspects of reliability. AI testing, on the other hand, encompasses a more comprehensive approach, ensuring models behave as expected before, during, and after deployment.

Distributional is designed to help continuously ask and answer the question “Is my AI-powered app performing as desired?” In pursuit of this goal, we consider testing at three different elements of this lifecycle.

Production testing: Testing actual app usages observed in production to identify any unsatisfactory or unexpected app behavior.
- This occurs during the Observe step.
Deployment testing: Testing the app currently deployed with a fixed (a.k.a. golden) dataset to identify any nonstationary behavior in components of the app (e.g., a third party LLM.)
- This occurs during the Deploy step and in the arrow to the Observe step
Development testing: Testing new app versions on a chosen dataset to confirm that bugs have been fixed or improvements have been implemented.
- This occurs between Build and Deploy.

Components of AI Testing

Essential Components of AI Testing

An AI Testing strategy can and often includes popular reliability modalities such as model evaluation and traditional monitoring of summary statistics.

Evaluation frameworks are an important tool in the development phase of AI application and can provide insights such as “based on X metric, model A is better than model B on this test set.” In development, the primary role of evaluations is to ensure that the application is performing at an acceptable level. After the app is deployed to production, online evaluations can be performed on production data to check how key performance metrics vary over time. Evaluation frameworks provide a set of off-the shelf metrics that you can use to perform evaluations, however many people choose to write custom evaluations related to business metrics or other objectives specific to their use case.
Observability/monitoring tools log and can respond to events in production. By nature, observability tools catch issues after the fact, treating users as tests, and often miss subtle degradations. Most observability tools look at summary statistics (e.g. P90, P50, accuracy, F1-score, BLEU, ROUGE) over time, but do not provide insights on the underlying distributions of those metrics, and across time.

In an AI Testing framework, you can use eval metrics as the basis for tests by setting thresholds and an alert when that test, or some set of tests fails. With Distributional in particular, you can bring your own custom evals, eval metrics from an open source library, or use the off-the-shelf eval metrics provided by the platform.

Tests can also evaluate more abstract concepts such as whether or not your app looks similar to the day (or week) before.

Distributional Testing

Why we need to test distributions

AI testing requires a very different approach than traditional software testing. The goal of testing is to enable teams to define a steady baseline state for any AI application, and through testing, confirm that it maintains steady state, and where it deviates, figure out what needs to evolve or be fixed to reach steady state once again. This process needs to be discoverable, logged, organized, consistent, integrated and scalable.

AI testing needs to be fundamentally reimagined to include statistical tests on distributions of quantities to detect meaningful shifts that warrant deeper investigation.

Distributions > Summary Statistics: Instead of only looking at summary statistics (e.g. mean, median, P90), we need to analyze distributions of metrics, over time. This accounts for the inherent variability in AI systems while maintaining statistical rigor.

Why is this useful? Imagine you have an application that contains an LLM and you want to make sure that the latency of the LLM remains low and consistent across different types of queries. With a traditional monitoring tool, you might be able to easily monitor P90 and P50 values for latency. P50 represents the latency value below which 50% of the requests fall and will give you a sense of the typical (median) response time that users can expect from the system. However, the P50 value for a normal distribution and bimodal distribution can be the same value, even though the shape of the distribution is meaningfully different. This can hide significant (usage-based or system-based) changes in the application that affect the distribution of the latency scores. If you don’t examine the distribution, these changes go unseen.

Consider a scenario where the distribution of LLM latency started with a normal distribution, but due to changes in a third-party data API that your app uses to inform the response of the LLM, the latency distribution becomes bimodal, though with the same median (and P90 values) as before. What could cause this? Here’s a practical example of how something like this could happen. The engineering team of the data API organization made an optimization to their API which allows them to return faster responses for a specific subset of high value queries, and routes the remainder of the API calls to a different server which has a slower response rate.

The effect that this has on your application is that now half of your users are experiencing an improvement in latency, and now a large number of users are experiencing “too much” latency and there’s an inconsistent performance experience among users. Solutions to this particular example include modifying the prompt, switching the data provider to a different source, format the information that you send to the API differently or a number of other engineering solutions. If you are not concerned about the shift and can accept the new steady state of the application, you can also choose to not make changes and declare a new acceptable baseline for the latency P50 value.

Getting Access to Distributional

Getting Access

To gain access to the Distributional platform, please reach out to our team. We’ll guide you through the process and ensure you have everything you need to get started.

Learning about Distributional

The Distributional Framework

Key Distributional concepts and their role relative to your app

Distributional testing requires more information than standard deterministic testing to address the motivating bullets (nondeterminism, nonstationarity, interactions) described earlier. Each time you want to measure the behavior of the app, Distributional wants you to:

Record outcomes at all of the app’s components, and
Push a distribution of inputs through the app to study behavior across the full spectrum of possible app usage.

The inputs, outputs, and outcomes associated with a single app usage are grouped in a Result, with each value in a result described as a Column. The group of results that are used to measure app behavior is called a Run. To determine if an app is behaving as expected, you create a Test, which involves statistical analysis on one or more runs. When you apply your tests to the runs that you want to study, you create a Test Session, which is a permanent record of the behavior of an app at a given time.

Defining Tests in Distributional

What's in a test?

Tests are the core mechanism for asserting acceptable app behavior within Distributional. In this section, we introduce the necessary tools and explain how they can be used -- either with guidance from Distributional or by users based on unique needs.

At its core, a Test is a combination of a Statistic, derived from a run, which defines some behavior of the run, and an Assertion which enumerates the acceptable values of that statistic (and, thus, the acceptable behavior of a run). The run under consideration in a test is referred to as the Experiment run; often, there will also be a Baseline run used for comparison.

Test Tags can be created and applied to tests to group them together for, among other purposes, signaling the shared purpose of several tests (e.g., tests for text sentiment).

When a group of tests is executed on a chosen Experiment and Baseline Run, the output is a Test Session containing which assertions Pass and Fail.

Automated Production test creation & execution

Distributional can automate your Production testing process

Production testing is the core tool in the Distributional toolkit – it provides the ability to continually state whether your app is performing as desired.

To make Production testing easy to start using, we have developed some automated Production test creation capabilities. Using sample data that you upload to our system, we can craft tests which help determine if there has been a shift in behavior for your app.

The Auto-Generate Tests button takes you to the modal below, which allows you to choose the columns on which you want to test for app consistency.

Setting a Baseline through a Run Query

You also have the opportunity to set a Baseline Run Query, which gives you the freedom to define how far into the past you want to look to define “consistency”. After setting the Baseline, your tests will utilize it by default; you can override it at test creation time. Over time, your project will graphically render the health of your app, as shown below.

Recalibrating tests

You may later deactivate or any of the automatically generated tests as you feel the desire to adapt these tests to meet your needs. In particular, we recommend that you review the first several test sessions in a new Project and recalibrate based on observed behavior. If you believe that no significant difference exists between tested runs, you should recalibrate the generated tests to pass.

Comprehensive testing with Distributional

This is how you test when you are a dbnl expert

When you start work with Distributional (dbnl), you should focus on creating and executing to ask and answer the question "Is my AI-powered app behaving as desired?" As you gain more confidence using dbnl, the full pattern of standard dbnl usage looks as follows:

Execute Production testing at a regular interval (e.g., nightly) on recent app usages
Execute Deployment testing at a regular interval (e.g., weekly) on a fixed dataset
Review Production test sessions to start triage of any concerning app behavior
1. This could be triggered by test failures, DBNL-automated guidance, or manual investigation
As concerning app behavior is identified, trigger Deployment tests to help diagnose any app component nonstationarity
1. If nonstationarity is present, start Development testing on the affected 3rd party components
2. If the components are stationary, start Development testing on the observed app examples which show unsatisfactory behavior
Once suitable app behavior is recorded in Development testing, record a new baseline for future Deployment testing on the fixed dataset
Push the new app version into production and update the Deployment + Production testing process

Reviewing Test Sessions and Runs in Distributional

Test sessions provide an opportunity to learn about your app's behavior

After you define and execute tests, dbnl creates a Test Session to mark a permanent statement of how your app behaved relative to expectations. You can explore this Test Session, as well as the Runs that were tested, to learn about your app -- both its behavior and about how users interacted with it.

The basic review process is available for all tests, including any tests you may have created with expert knowledge. In this section, we highlight key insights that can be gained from the automated Production tests, insights from elsewhere in the Distributional UI, and how you can be notified about your Test Session status in whatever fashion you prefer.

Reviewing and recalibrating automated Production tests

Directing dbnl to execute the tests you want

A key part of the dbnl offering is the creation of automated Production tests. After their creation, each Test Session offers you the opportunity to Recalibrate those tests to match your expectations. For GenAI users, we think of this as the opportunity to “codify your vibe checks” and make sure future tests pass or fail as you see fit.

The previous section showed a brief snapshot of a test session to understand how your app has been performing. Our UI also provides advanced capabilities that allow you to dig deeper into our automated Production tests. Below we can see a sample Test Session with a suite of dbnl-generated tests. The View Test Analysis button lets you dig deeper into any subset of tests – in this image, we have subselected only the failed tests to try and learn whether there is something sufficiently concerning that should fail.

In the subsequent page, there is a Notable Results tab where dbnl provides a subset of app usages that we feel are the most extremely different between the Baseline and Experiment run. When you leaf through these Question/Answer pairs, we do not see anything terribly frightening— just the standard randomness of LLMs. As such, on the original page, I choose to Recalibrate Generated Tests to pass, and I will not be alerted in the future.

We recommend that you always inspect the first 4-7 test sessions for a new Project. This helps ensure that the tests effectively incorporate the nondeterministic nature of your app. After those initial Recalibration actions, you can define notifications to only trigger when too many tests fail.

Insights surfaced elsewhere on Distributional

Test Sessions are not the only place to learn about your app

Distributional’s goal is to provide you the ability to ask and answer the question “Is my AI-powered app behaving as expected?” While testing is a key component of this, triage of a failed test and inspiration for new tests can come from many places in our web UI. Here, we show insights into app behavior which are uncovered with Distributional.

Run Detail page

Because each Run represents the recent behavior of your app, the Run Detail page is a useful source of insights about your app’s behavior. In the screenshot below, you can see:

dbnl-generated alerts regarding highly correlated columns (in depth on a separate screen),
Summary statistics for columns (along with shortcuts to for any statistics of note), and
Notable behavior for columns, such as a skewed or multimodal distribution.

Compare page

At the top of the Run Detail page, there are links to the Compare and Analyze pages, where you can conduct more in depth and customized analysis. You can drive your own analysis at these pages to uncover key insights about your app.

For example, after seeing a failed Test Session in a RAG (Q & A) application, you may visit the Compare page to understand the impact of adding new documents to your vector database. The image below shows a sample Compare page, which reveals a sizable decrease in the population of poorly-retrieved questions (drop in the low bleu value between Baseline and Experiment).

(the screenshot below) gives a valuable insight about the impact of the extra documents. You see that, previously, the RAG app was incorrectly retrieving documents from “Liabilities and Contingencies” as well as “Asset Valuations.” Adding the new documents improved your app’s quality, and now you can confidently answer, “Yes, my app’s behavior has changed, and I am satisfied with its new behavior.”

Notifications

Customize how you want to be alerted to new Runs, new Test Sessions, and high severity failures

Distributional analyzes your data and executes your tests, but we also want to alert you about the status of your tests and the health of your app as new data arrives. We provide a notifications suite, to enable you to receive updates about your app’s health as you desire.

Notifications are configured as part of a Project, as seen in the screenshot above. When a test session fails by some predefined margin (e.g., more than 20% failures), members of your team can receive an alert, such as a Pagerduty. Or, if certain tests concern only certain members of your team, notifications to those members can be limited to only those tests failing.

Learn more about notifications here.

Data in Distributional

Data goes in, insights come out

Distributional runs on data, so we make it easy for you to:

Push data into Distributional
Augment your unstructured data, such as text, to facilitate testing
Organize your data to enable more insightful testing
Review data that has been sent to Distributional

Components and the DAG for root cause analysis

Distributional runs on data. When that data is well organized, Distributional can provide more powerful insights into your app’s behavior. One key way to do so is through defining components of your app and the organization of those components within your app.

Components are groupings of columns – they provide a mechanism for identifying certain columns as being generated at the same time or otherwise relating to each other. As part of your data submission, you can define that grouping as well as a flow of data from one component to another. This is referred to as the DAG, the directed acyclic graph.

In the example above, we see:

Data being input at TweetSource (1 column),
An EntityExtractor (2 columns) and a SentimentClassifier (4 columns) parsing that data, and
A TradeRecommender (2 columns) taking the Entities and Sentiments and choosing to execute a stock trade.

Tests were created to confirm the satisfactory behavior of the app. We can see that the TradeRecommender seems to be misbehaving, but so is the SentimentClassifier. Because the SentimentClassifier appears earlier in the DAG, we start our triage of these failed tests there.

To see more of this example, visit this tutorial.

Uploading data to Distributional

Distributional runs on data, but our goal is to enable you to operate on data you already have available. If you are using a golden dataset to guide your development, we want you to use that to power your Development and Deployment testing. If you have actual Question-Answer pairs from production that are sitting in your data warehouse, we recommend that you to execute Production testing on that data to continually assert that your app is not misbehaving.

Generally, data is organized on Distributional in the form of a parquet file full of app usages, e.g., the prompts and summaries observed in the last 24 hours. This data is then packaged up and shipped to Distributional’s API, primarily through our SDK. This could include any contextual information that can help determine if the app is behaving as desired, such as the day of the week.

Prior to shipping the data to dbnl, the dbnl Eval library can be used to augment your data (especially text data) with additional columns for a more complete testing experience.

Living in your VPC

Distributional runs on data, but we know that your data is incredibly valuable and sensitive. This is why Distributional is built to be deployed in your private cloud. We have AWS and GCP installations today and are working towards an Azure installation. There is also a SaaS version available for demonstrations or proofs of concept.

Using Distributional

Getting Started

Installing the Python SDK and Accessing Distributional UI

For getting access to the Distributional platform, .

Installing Distributional

The dbnl SDK supports . You can install the latest release of the SDK with the following command on Linux or macOS, install a specific release, and install :

1. Latest Stable Release

To install the latest stable release of the dbnl package:

2. Specific Release

To install a specific version (e.g., version 0.19.1):

3. Installing with the `eval` Extra

The dbnl.eval extra includes additional features and requires an external spaCy model.

3.1. Install the Required spaCy Model

To install the required en_core_web_sm pretrained English-language NLP model model for spaCy:

3.2. Install `dbnl` with the `eval` Extra

To install dbnl with evaluation extras:

If you need a specific version with evaluation extras (e.g., version 0.19.1):

Accessing the Distributional UI and API token

You should have already received an invite email from the Distributional team to create your account. If that is not the case, please reach out to your Distributional contact. You can access your token at (which will prompt you to login if you are not already).

We recommend setting your API token as an environment variable, see below.

Environment Variables

DBNL has three reserved environment variables that it reads in before execution.

Variable Name

Description

Linux/Mac OS Set Up

Run the following commands in your terminal. Make sure to wrap the API token in quotes.

To confirm that the dbnl API Token is set in your environment, run the following command and verify its output is the correct token.

Organization and Namespaces

Resources in the dbnl platform are organized using organizations and namespaces.

Organization

An organization, or org for short, corresponds to a dbnl deployment.

Organization Resources

Some resources, such as users, are defined at the organization level. Those resources are sometimes referred to as organization resources or org resources.

Namespaces

A namespace is a unit of isolation within a dbnl organization.

Namespace Resources

Most resources, including projects and their related resources, are defined at the namespace level. Resources defined within a namespace are only accessible within that namespace providing isolation between namespaces.

Default Namespace

All organizations include a namespace named default. This namespace cannot be modified or deleted.

By default, users are assigned the namespace reader role in the default namespace.

Switching Namespace

To switch namespace, use the namespace switcher in the navigation bar.

Creating a Namespace

To create a namespace, go to ☰ > Settings > Admin > Namespaces and click the + Create Namespace button.

Creating a namespace requires having the org admin role.

Adding a User to a Namespace

See Users and Permissions.

Users and Permissions

Discover how dbnl manages user permissions through a layered system of organization and namespace roles—like org admin, org reader, namespace admin, writer, and reader.

Users

A user is an individual who can log into a dbnl organization.

Permissions

Permissions are settings that control access to operations on resources within a dbnl organization. Permissions are made up of two components.

Resource: Defines which resource is being controlled by this permission (e.g. projects, users).
Verb: Defines which operations are being controlled by this permission (e.g. read, write).

For example, the projects.read permission controls access to the read operations on the projects resource. It is required to be able to list and view projects.

Roles

A role consists in a set of permissions. Assigning a role to a user gives the user all the permissions associated with the role.

Roles can be assigned at the organization or namespace level. Assigning roles at the namespace level allows for giving users granular access to projects and their related data.

Org Roles

An org role is a role that can be assigned to a user within an organization. Org role permissions apply to resources across all namespaces.

There are two default org roles defined in every organization.

Org admin

The org admin role has read and write permissions for all org level resources making it possible to perform organization management operations such as creating namespaces and assigning users roles.

By default, the first user in an org is assigned the org admin role.

Org reader

The org reader role has read-only permissions to org level resources making it possible to navigate the organization by listing users and namespaces.

By default, all users are assigned the org reader role.

Assigning a User an Org Role

To assign a user an org role, go to ☰ > Settings > Admin > Users, scroll to the relevant user and select the an org role from the dropdown in the Org Role column.

Assigning a user an org role requires having the org admin role.

Namespace Roles

A namespace role is a role that can be assigned to a user within a namespace. Namespace role permissions only apply to resources defined within the namespace in which the role is assigned.

There are three default namespace roles defined in every organization.

Namespace admin

The namespace admin role has read and write permissions for all namespace level resources within a namespace making it possible to perform namespace management operations such as assigning users roles within a namespace.

By default, the creator of a namespace is assigned the namespace admin role in that namespace.

Namespace writer

The namespace admin role has read and write permissions for all namespace level resources within a namespace except for those resources and operations related to namespace management such as namespace role assignments.

By default, all users are assigned the namespace writer role in the default namespace.

(Experimental) Namespace reader

The namespace reader role has read-only permissions for all namespace level resources within a namespace.

This is an experimental role that is available through the API, but is not currently fully supported in the UI.

Assigning a User a Namespace Role

To assign a user a namespace role within a namespace, go to ☰ > Settings > Admin > Namespaces, scroll and click on the relevant namespace and then click + Add User.

Assigning a user a namespace role requires having the org admin role or the namespace admin role in that namespace.

Tokens

Tokens are used for programmatic access to the dbnl platform.

Personal Access Tokens

A personal access token is a token that can be used for programmatic access to the dbnl platform through the SDK.

Tokens are not revocable at this time. Please remember to keep your tokens safe.

Permissions

A personal access token has the same permissions as the user that created it. See Users and Permissions for more details about permissions.

Token permissions are resolved at use time, not creation time. As such, changing the user permissions after creating a personal access token will change the permissions of the personal access token.

Create a Personal Access Token

To create a new personal access token, go to ☰ > Personal Access Tokens and click Create Token.

Personal access tokens are implemented using JSON Web Tokens and are not persisted. Tokens cannot be recovered if lost and a new token will need to be created.

Data

DBNL provides a in Python to organize this data, submit it to our system, and interact with our API. In this section we introduce key objects in the dbnl framework and describe using this SDK to interact with these objects.

Data Objects

The objects needed to define a Run, the core data structure in DBNL

The core object for recording and studying an app’s behavior is the run, which contains:

a table where each row holds the outcomes from a single app usage,
structural information about the components of the app and how they relate, and
user-defined metadata for remembering the context of a run.

Runs live within a selected project, which serves as an organizing tool for the runs created for a single app.

The structure of a run is defined by its , or RunConfig. This informs dbnl about what information will be stored in each result (the columns) and how the app is organized (the components). A Component is a mechanism for grouping columns based on their role within the app; this information is stored in the run configuration.

Using the functionality within the run configuration, you also have the ability to designate Unique Identifiers – specific columns which uniquely identify matching results between runs. Adding this information enables specific tests of individual result behavior.

The data associated with each run is passed to dbnl as a pandas dataframe.

Data Storage Integrations

As Distributional continues to evolve, we have made it our mission to meet customers where they, and their data, live. As such, we are building out integrations to pull in data from common cloud storage systems including, but not limited to, Snowflake and Databricks. Please contact your dedicated engineer to discuss the status of our integrations and implementing them in your system.

Data Access Controls

An overview of data access controls.

Data for a run is split between the object store (e.g. S3, GCS) and the database.

Metadata (e.g. name, schema) and aggregate data (e.g. summary statistics, histograms) are stored in the database.
Raw data is stored in the object store.

All data accesses are mediated by the API ensuring the enforcement of access controls. For more details on permissions, see Users and Permissions.

Database

Database access is always done through the API with the API enforcing access controls to ensure users only access data for which they have permission.

Object Store

Direct object store access is required to upload or download raw run data using the SDK. Pre-signed URLs are used to provide limited direct access. This access is limited in both time and scope, ensuring only data for a specific run is accessible and that it is only accessible for a limited time.

When uploading or downloading data for a run, the SDK first sends a request for a pre-signed upload or download URL to the API. The API enforces access controls, returning an error if the user is missing the necessary permissions. Otherwise, it returns a pre-signed URL which the SDK then uses to upload or download the data.

Uploading data to a run in a given namespace requires write permission to runs in that namespace. Downloading data from a run in a given namespace requires read permission to runs in that namespace.

Testing

Tests are the key tool within dbnl for asserting performance and consistency of runs. Possible goals during testing can include:

Asserting that a chosen column meets its minimum desired behavior (e.g., inference throughput);
Asserting that a chosen column has a distribution that roughly matches the baseline reference;
Asserting that no individual results have a severely divergent behavior from a baseline.

In this section, we explore the objects required for testing, methods for creating tests, suggested testing strategies, reviewing/analyzing tests, and best practices.

Creating Tests

There are several ways for dbnl users to create tests, each of which may be useful in different circumstances and for different users within the same organization.

Test Page

The base location at which a test can be created is the Test Configuration page, which is accessible from the Project Detail page.

From here, you can click Add Test to open the test creation page, which will enable you to define your test through the dropdown menu on the left side of the window.

The graphs available on the right side of the window can help guide test development as you choose the statistics you want to study and the thresholds which define acceptable behavior.

Test Templates

Test templates are macros for basic test patterns recommended by Distributional. It allows the user to quickly create tests from a builder in the UI. Distributional provides five classes of Test Templates.

Single Run: These are parametric statistics of a column.
Similarity of Statistics: These test if the absolute difference of a statistic of a column between two runs is less than a threshold.
Similarity of Distributions: These test if the column from two different runs are similarly distributed is using a nonparametric statistic.
Similarity of Results: These are tests on the row-wise absolute difference of result
Difference of Statistics: These tests the signed difference of a statistic of a column between two runs

Creating Test from Test Templates

From the Test Configuration page, click the Test Template dropdown under My Tests.

Select from one of the five options and click ADD TEST. A Test Creation drawer will appear and the user can edit the statistic, column, and assertion that they desire. Note that each Test Template has a limited set of statistics that it supports.

SDK

Tests can be programmatically created using the python SDK. Users must provide a JSON dictionary that adheres to the dbnl Test Spec and instantiates the key elements of a Test. The Testing Strategies section has sample content around creating tests using JSON. Additionally, the test creation page will soon also have the ability to convert a test from dropdowns to JSON, so that many tests of similar structure can be created in JSON after the first one is designed in the web UI.

Defining Assertions

Assertions are the second half of test creation — defining what statistical values seem appropriate or aberrant

A test consists of statistical quantities that define the behavior of one or more runs and an associated threshold which determines whether those statistics define acceptable behavior. Essentially, a test passes or fails based on whether the quantities satisfy the thresholds. Common assertions include:

equal_to – Used most often to confirm that two distributions are exactly identical
close_to – Can be used to confirm that two quantities are near each other, e.g., that the 90th
less_than – Can be used to confirm that nonparametric statistics are small enough to indicate that two distributions are suitably similar

A full list of assertions can be found at the test creation page.

Choosing a Threshold

Choosing a threshold can be simple, or it can be complicated. When the statistic under analysis has easily interpreted values, these thresholds can emerge naturally.

Example: The 90th percentile of the app latency must be less than 123ms.
- Here, the statistic is the percentile(EXPERIMENT.latency, 90), and the threshold is “less than 123”.
Example: No result should have more than a 0.098 difference in probability of fraud prediction against the baseline.
- Here, the statistic is max(matched_abs_diff(EXPERIMENT.prob_fraud, BASELINE.prob_fraud)) and the threshold is “less than 0.098”.

In situations where nonparametric quantities, like the scaled Kolmogorov-Smirnov statistic, are used to state whether two distributions are sufficiently similar, it can be difficult to identify the appropriate threshold. This is something that may require some trial and error, as well as some guidance from your dedicated Applied Customer Engineer. Please reach out as you would like our support developing thresholds for these statistics.

Research into Adaptive Testing Strategies

Additionally, we are hard at work developing new strategies for adaptively learning thresholds for our customers based on their preferences. If you would like to be a part of this development and beta testing process, please inform your Applied Customer Engineer.

Production Testing

What is production testing?

Production testing focuses on the need to regularly check in on the health of an app as it has behaved in the real world. Users want to detect changes in either their AI app behavior or the environment that it is operated in. At Distributional, we help users to confidently answer these questions.

How does Distributional do production testing?

In the case of production testing, Distributional recommends testing the similarity of distributions of the data related to the AI app between the current Run and a baseline Run. Users can start with the feature to let Distributional generate the necessary production tests for a given Run.

Auto-Test Generation

In order to facilitate production testing, Distributional can automatically generate a suite of Tests based on the user’s uploaded data. Distributional studies the Run results data and create the appropriate thresholds for the Tests.

Generate Production Tests

In the Test Config page, click the AUTO-GENERATE TESTS button; this will prompt a modal where the user can select the Run which the tests will be based on. Click Select All to generate tests for all columns or optionally sub-select the columns to generate tests. Click GENERATE TESTS to generate the production tests.

Once the Tests are generated, you can view them under the Generated Tests section.

Notable Results

After a Test Session is executed, users might want to introspect on which Tests passed and failed. In addition, they want to understand what caused a Test or a group of Tests to fail; in particular, which subset of from the Run likely caused the Tests to fail.

Notable results are only presented for Distributional

Reviewing notable results

To review the notable results, first select the desirable set of Generated Tests that you want to study from the Test Session details page. This may include both failed and passed Tests.

Click VIEW TEST ANALYSIS button to enter the Test Analysis page. On this page, you can review the notable results for the Experiment Run and Baseline Run under the Notable Results tab.

Dynamic Baseline

Instead of comparing the newest Run to a fixed Baseline Run, a user might want to dynamically shift the baseline. For example, one might want to compare the new Run to the most recently completed Run. To enable dynamic baseline, user needs to create a Run Query from the Test Config page.

Under the Baseline Run dropdown, select Create run query; this will prompt a Run Query modal. In the model, you can write the name of the Run Query and select the offsetting Run to be used for the dynamic baseline. For example: 1 implies each new Run is tested against the previous uploaded Run.

Click SAVE to save this Run Query. Back in the Test Config page, select the Run Query from the dropdown and click SAVE to save this setting as the dynamic baseline.

Testing Strategies

Like in traditional software testing, it is paramount to come up with a testing strategy that has both breadth and depth. Such a set of tests gives confidence that the AI-powered app is behaving as expected, or it alerts you that the opposite may be true.

To build out a comprehensive testing strategy it is important to come up with a series of assertions and statistics on which to create tests. This section explains several goals when testing and how to create tests to assert desired behavior.

Test That a Given Distribution Has Certain Properties

A common type of test is testing whether a single distribution contains some property of interest. Generally, this means determining whether some statistics for the distribution of interest exceeds some threshold. Some examples of this can be testing the toxicity of a given LLM or the latency for the entire AI-powered application.

This is especially common for development testing, where it is important to test if a proposed app reaches the minimum threshold for what is acceptable.

Example Test Spec

Test That Distributions Have the Same Statistics

Another type of test is testing if a particular statistic is similar for two different distributions. For example, this can be testing if the absolute difference of medians of sentiment score between the experiment run and the baseline run is small — that is, the scores are close to one another.

Example Test Spec

Test That Columns Are Similarly Distributed

One general approach to test if two columns are similarly distributed is using a nonparametric statistic. DBNL offers two such statistics: scaled_ks_stat for testing ordinal distributions and scaled_chi2_stat for testing nominal distributions.

Example Test Spec

{
    "name": "discrepancy_of_text_coherence_score",
    "description": "Test the nonparametric discrepancy of the coherence score distributions",
    "statistic_name": "scaled_ks_stat",
    "statistic_params": {},
    "assertion": {
        "name": "less_than_or_equal_to",
        "params": {
            "other": 0.25,
        },
    },
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.coherence_score"
            }
        },
        {
            "select_query_template": {
                "select": "{BASELINE}.coherence_score"
            }
        },
    ],
}

Test That Specific Results Have Matching Behavior

When the results from a run have unique identifiers, one can create a special type of tests for testing matching behavior at a per-result level. One example would be testing the mean of per-result absolute difference does not exceed a threshold value.

Example Test Spec

Executing Tests

After tests are created for an associated project, there are two ways that they can be executed.

Manually via the UI on the Project Details page
Via the SDK using the create_test_session method

Manually Running Tests Via UI

You can choose to run tests associated with a project by clicking on the Run Tests button on the project details page. This button will open up a modal that allows you to specify the baseline and experiment runs as well as the tags of the tests you would like to include or exclude from the test session.

Executing Tests Via SDK

Tests can also be executed via the SDK after results data has been reported. This requires the following steps.

Report an initial run and set that as a baseline

Setting baseline in SDK

Alternatively, you can also set a Run as baseline using the set_run_as_baseline function.

Close a new run in that project

Executing the close_run command for a new run in that same project will finalize the data, enabling it for use in Tests.

Create Test Session

Execute the create_test_session command, providing your new run as the "experiment". Tests will then use the previously-defined baseline for comparisons.

Reviewing Tests

After the tests have been executed, what comes next?

Executing tests creates a Test Session, which summarizes the specified tests associated with the session, specifically highlighting whether each test’s assertion has passed or failed.

The Test History section of the Project Detail pages is a record of all the test sessions created over time. Each row of the Test History table represents a Test Session. You can click anywhere on a Test Session row to navigate to the Test Session Detail page where more detailed information for each test can be viewed.

Using Filters

Distributional allows users to apply filters on run data they have uploaded. Applying a filter selects for only the rows that match the filter criteria. The filtered rows can then be visualized or used to create tests.

We will show how filters can be used to explore the data created by the and build filtered tests.

Python SDK

The primary mechanism for submitting data to Distributional is through our Python SDK. This section contains information about how to set up the SDK and use it for your AI testing purposes.

Quick Start

Below is a basic working example that highlights the SDK workflow. If you have not yet installed the SDK, follow .

Functions

Click through to view all the SDK functions.

Project

Functions that interact with a dbnl

create_project

Description

Description cannot be updated with this function.

Returns

Type

Description

Examples

Run Config

Functions related to dbnl RunConfig

get_latest_run_config

Retrieve the most recent dbnl RunConfig

Parameters

Arguments

Description

Returns

Type

Description

Examples

get_run_config

Retrieve a dbnl RunConfig

dbnl.get_run_config(
    *,
    run_config_id: str,
) -> :

Parameters

Arguments

Description

run_config_id

The ID of the dbnl . RunConfig ID starts with the prefix runcfg_ . Run ID can be found at the Run detail page or Project detail page. An error will be raised if there does not exist a RunConfig with the given run_config_id.

Returns

Type

Description

The dbnl RunConfig with the given ID.

Examples

import dbnl
dbnl.login()


proj1 = dbnl.get_or_create_project(name="test_p1")
runcfg1 = dbnl.create_run_config(project=proj1, columns=[{"name": "error", "type": "float"}])

# Retrieving the RunConfig by ID
runcfg2 = dbnl.get_run_config(run_config_id=runcfg1.id)
assert runcfg1.id == runcfg2.id

# DBNLRunConfigNotFoundError: A DBNL RunConfig with id not_exist does not exist
run_config3 = dbnl.get_run_config(run_config_id="runcfg_not_exist")

get_run_config_from_latest_run

Retrieve a dbnl RunConfig from the most recent Run in a Project

Parameters

Arguments

Description

Returns

Type

Description

Examples

Run Results

You can only call get_scalar_results after the run is .

Examples

get_results

Retrieve results from dbnl

dbnl.get_results(
    *,
    run: ,
) -> ResultData:

Parameters

Arguments

Description

run

The dbnl from which to retrieve the results.

Returns

Type

Description

ResultData

A named tuple that comprises of columns and scalars fields. These are the s of the data for the particular Run.

You can only call get_results after the run is closed.

Examples

import dbnl
import pandas as pd
dbnl.login()


proj = dbnl.get_or_create_project(name="test_p1")

uploaded_data = pd.DataFrame({"error": [0.11, 0.33, 0.52, 0.24]})
run = dbnl.report_run_with_results(
    project=proj,
    column_results=uploaded_data,
)

downloaded_data = dbnl.get_results(run=run)
assert downloaded_data.columns.equals(uploaded_data)

report_column_results

Report all column results to dbnl

Parameters

Arguments

Description

Limitations

All data should be reported to dbnl at once. Calling dbnl.report_column_results more than once will overwrite the previously uploaded data.

Once a Run is . You can no longer call report_column_results to send data to dbnl.

Examples

report_scalar_results

Report all scalar results to dbnl

Parameters

Arguments

Description

Limitations

All data should be reported to dbnl at once. Calling dbnl.report_scalar_results more than once will overwrite the previously uploaded data.

Once a Run is . You can no longer call report_scalar_results to send data to DBNL.

Examples

Run

Functions interacting with dbnl Run

close_run

Finalize a Run

Baseline

Functions that interact with dbnl Baseline concept

get_run_query

Retrieve a dbnl RunQuery with the given name

Parameters

Arguments

Description

Returns

Type

Description

Example

set_run_as_baseline

Set a given Run as the Baseline Run in a Project's Test Config

dbnl.set_run_as_baseline(
    *,
    run: ,
) -> None:

Parameters

Arguments

Description

run

The dbnl Run to be set as the Baseline Run in its Project Test Config.

Test Session

Functions that interact with dbnl

Objects

Project

class Project:
    id: str
    name: str
    description: Optional[str] = None

Fields

Argument

Type

Description

id

str

The ID of the Project. Project ID starts with the prefix proj_

name

str

The name of the Project. Project names have to be unique.

description

str

An optional description of the Project. Descriptions are limited to 255 characters.

Supported Functions

dbnl.create_project

dbnl.get_project

dbnl.get_or_create_project

RunQuery

Fields

Argument

Description

Supported Functions

TestSession

class TestSession(DBNLObject):
    id: str
    project_id: str
    inputs: list[]
    status: Literal["PENDING", "RUNNING", "PASSED", "FAILED"]
    failure: Optional[str] = None
    num_tests_passed: Optional[int] = None
    num_tests_failed: Optional[int] = None
    num_tests_errored: Optional[int] = None
    include_tag_ids: Optional[list[str]] = None
    exclude_tag_ids: Optional[list[str]] = None
    require_tag_ids: Optional[list[str]] = None

TestSessionInput

class TestSessionInput(DBNLObject):
    run_alias: str
    run_id: Optional[str] = None
    run_query_id: Optional[str] = None

Supported Functions

dbnl.create_test_session

dbnl.experimental.get_test_sessions

dbnl.experimental.wait_for_test_session

create_run_config

Create a new dbnl RunConfig

dbnl.create_run_config(
    *,
    project: ,
    columns: list[dict[str, Any]],
    scalars: Optional[list[dict[str, Any]]] = None,
    description: Optional[str] = None,
    display_name: Optional[str] = None,
    row_id: Optional[list[str]] = None,
    components_dag: Optional[dict[str, list[str]]] = None,
) -> :

Parameters

Arguments

Description

project

The this RunConfig is associated with.

columns

A list of column schema specs for the uploaded data, required keys name and type, optional key component, description and greater_is_better. type can be int, float, category, boolean, or string. component is a string that indicates the source of the data. e.g. "component" : "sentiment-classifier" or "component" : "fraud-predictor". Specified components must be present in the components_dag dictionary. greater_is_better is a boolean that indicates if larger values are better than smaller ones. False indicates smaller values are better. None indicates no preference. Example:

columns=[{"name": "pred_proba", "type": "float", "component": "fraud-predictor"}, {"name": "decision", "type": "boolean", "component": "threshold-decision"}, {"name": "requests", "type": "string", "description": "curl request response msg"}]

See the section below for more information.

scalars

NOTE: scalars is available in SDK v0.0.15 and above. A list of scalar schema specs for the uploaded data, required keys name and type, optional key component, description and greater_is_better. type can be int, float, category, boolean, or string. component is a string that indicates the source of the data. e.g. "component" : "sentiment-classifier" or "component" : "fraud-predictor". Specified components must be present in the components_dag dictionary. greater_is_better is a boolean that indicates if larger values are better than smaller ones. False indicates smaller values are better. None indicates no preference. An example RunConfig scalars: scalars=[{"name": "accuracy", "type": "float", "component": "fraud-predictor"}, {"name": "error_type", "type": "category"}] Scalar schema is identical to column schema.

description

An optional description of the RunConfig, defaults to None. Descriptions are limited to 255 characters.

display_name

An optional display name of the RunConfig, defaults to None. Display names do not have to be unique.

row_id

An optional list of the column names that can be used as unique identifiers, defaults to None.

components_dag

An optional dictionary representing the direct acyclic graph (DAG) of the specified components, defaults to None. Every component listed in the columns schema must be present in the components_dag. Example: components_dag={"fraud-predictor": ["threshold-decision"], 'threshold-decision': []} See the section below for more information.

Column Schema

Column Names

Column names can only be alphanumeric characters and underscores.

Supported Types

The following type supported as type in column schema

type

Notes

float

int

boolean

string

Any arbitrary string values. Raw string type columns do not produce any histogram or scatterplot on the web UI.

category

Equivalent of pandas . Currently only supports category of string values.

list

Currently only supports list of string values. List type columns do not produce any histogram or scatterplot on the web UI.

Components

The optional component key is for specifying the source of the data column in relationship to the AI/ML app subcomponents. Components are used in visualizing the components DAG.

Components DAG

The components_dag dictionary specifies the topological layout of the AI/ML app. For each key-value pair, the key represents the source component, and the value is a list of the leaf components. The following code snippet describes the DAG shown above.

components_dags={
    "TweetSource": ["EntityExtractor", "SentimentClassifier"],
    "EntityExtractor": ["TradeRecommender"],
    "SentimentClassifier": ["TradeRecommender"],
    "TradeRecommender": [],
    "Global": [],
}

Returns

Type

Description

A new dbnl RunConfig

Examples

Basic Usage

import dbnl
dbnl.login()


proj = dbnl.get_or_create_project(name="test_p1")
# create a new RunConfig
runcfg1 = dbnl.create_run_config(
    project=proj,
    columns=[
        {"name": "error_type", "type": "category"},
        {"name": "email", "type": "string", "description": "raw email text content from source"},
        {"name": "spam-pred", "type": "boolean"},
    ],
    display_name="Basic RunConfig for spam prediction",
)

RunConfig with DAG

import dbnl
dbnl.login()


proj = dbnl.get_or_create_project(name="test_p1")
# create a new RunConfig with DAG
runcfg1 = dbnl.create_run_config(
    project=proj,
    columns=[
        {"name": "error_type", "type": "category"},
        {"name": "email", "type": "string", "component": "data_source", "description": "raw email text content from source"},
        {"name": "spam-pred", "type": "boolean", "component": "spam_classifier"},
    ],
    display_name="Basic RunConfig for spam prediction",
    components_dag={
        "data_source": ["spam_classifier"]
        "spam_classifier": []
)

RunConfig with row_id

import dbnl
dbnl.login()


proj = dbnl.get_or_create_project(name="test_p1")
# create a new RunConfig
runcfg1 = dbnl.create_run_config(
    project=proj,
    columns=[
        {"name": "error_type", "type": "category"},
        {"name": "email", "type": "string", "description": "raw email text content from source"},
        {"name": "spam-pred", "type": "boolean"},
        {"name": "email_id", "type": "string", "description": "unique id for each email"},
    ],
    display_name="Basic RunConfig for spam prediction",
    row_id=["email_id"],
)

RunConfig with scalars

import dbnl
dbnl.login()


proj = dbnl.get_or_create_project(name="test_p1")
# create a new RunConfig
runcfg1 = dbnl.create_run_config(
    project=proj,
    columns=[
        {"name": "error_type", "type": "category"},
        {"name": "email", "type": "string", "description": "raw email text content from source"},
        {"name": "spam-pred", "type": "boolean"},
        {"name": "email_id", "type": "string", "description": "unique id for each email"},
    ],
    scalars=[
        {"name": "model_F1", "type": "float"},
        {"name": "model_recall", "type": "float"},
    ],
    display_name="Basic RunConfig for spam prediction",
)