LogoLogo
AboutBlogLaunch app ↗
v0.20.x
v0.20.x
  • Introduction to AI Testing
  • Welcome to Distributional
  • Motivation
  • What is AI Testing?
  • Stages in the AI Software Development Lifecycle
    • Components of AI Testing
  • Distributional Testing
  • Getting Access to Distributional
  • Learning about Distributional
    • The Distributional Framework
    • Defining Tests in Distributional
      • Automated Production test creation & execution
      • Knowledge-based test creation
      • Comprehensive testing with Distributional
    • Reviewing Test Sessions and Runs in Distributional
      • Reviewing and recalibrating automated Production tests
      • Insights surfaced elsewhere on Distributional
      • Notifications
    • Data in Distributional
      • The flow of data
      • Components and the DAG for root cause analysis
      • Uploading data to Distributional
      • Living in your VPC
  • Using Distributional
    • Getting Started
    • Access
      • Organization and Namespaces
      • Users and Permissions
      • Tokens
    • Data
      • Data Objects
      • Run-Level Data
      • Data Storage Integrations
      • Data Access Controls
    • Testing
      • Creating Tests
        • Test Page
        • Test Drawer Through Shortcuts
        • Test Templates
        • SDK
      • Defining Assertions
      • Production Testing
        • Auto-Test Generation
        • Recalibration
        • Notable Results
        • Dynamic Baseline
      • Testing Strategies
        • Test That a Given Distribution Has Certain Properties
        • Test That Distributions Have the Same Statistics
        • Test That Columns Are Similarly Distributed
        • Test That Specific Results Have Matching Behavior
        • Test That Distributions Are Not the Same
      • Executing Tests
        • Manually Running Tests Via UI
        • Executing Tests Via SDK
      • Reviewing Tests
      • Using Filters
        • Filters in the Compare Page
        • Filters in Tests
    • Python SDK
      • Quick Start
      • Functions
        • login
        • Project
          • create_project
          • copy_project
          • export_project_as_json
          • get_project
          • get_or_create_project
          • import_project_from_json
        • Run Config
          • create_run_config
          • get_latest_run_config
          • get_run_config
          • get_run_config_from_latest_run
        • Run Results
          • get_column_results
          • get_scalar_results
          • get_results
          • report_column_results
          • report_scalar_results
          • report_results
        • Run
          • close_run
          • create_run
          • get_run
          • report_run_with_results
        • Baseline
          • create_run_query
          • get_run_query
          • set_run_as_baseline
          • set_run_query_as_baseline
        • Test Session
          • create_test_session
      • Objects
        • Project
        • RunConfig
        • Run
        • RunQuery
        • TestSession
        • TestRecalibrationSession
        • TestGenerationSession
        • ResultData
      • Experimental Functions
        • create_test
        • get_tests
        • get_test_sessions
        • wait_for_test_session
        • get_or_create_tag
        • prepare_incomplete_test_spec_payload
        • create_test_recalibration_session
        • wait_for_test_recalibration_session
        • create_test_generation_session
        • wait_for_test_generation_session
      • Eval Module
        • Quick Start
        • Application Metric Sets
        • How-To / FAQ
        • LLM-as-judge and Embedding Metrics
        • RAG / Question Answer Example
        • Eval Module Functions
          • Index of functions
          • eval
          • eval.metrics
    • Notifications
    • Release Notes
  • Tutorials
    • Instructions
    • Hello World (Sentiment Classifier)
    • Trading Strategy
    • LLM Text Summarization
      • Setting the Scene
      • Prompt Engineering
      • Integration testing for text summarization
      • Practical considerations
Powered by GitBook

© 2025 Distributional, Inc. All Rights Reserved.

On this page
  • Objective
  • Setting the scene
  • Define the data schema (run_config)
  • Prepare the results for dbnl
  • Push the runs to dbnl
  • Define tests of consistent behavior
  • Execute the tests
  • Reviewing a completed test session

Was this helpful?

Export as PDF
  1. Tutorials

Hello World (Sentiment Classifier)

In this tutorial, we demonstrate core Distributional usage, including data submission and test execution, on a tweet sentiment classifier.

PreviousInstructionsNextTrading Strategy

Was this helpful?

The data files required for this tutorial are available in the following file.

Objective

The objective of this tutorial is to demonstrate how to use Distributional, often abbreviated dbnl. You will write and execute tests that provide confidence that the underlying model is consistent with its previous version. At the end of this tutorial, you will have learned about the following elements of successful dbnl usage:

  • Define data schema - Create a run config which explains to dbnl the columns which will be sent and how those columns interact with each other.

    • Once this work is done, the same run config will be reused during consistent integration testing.

  • Push data to DBNL - Use the Python SDK to report runs to dbnl.

    • Please contact your Applied Customer Engineer to discuss integrations with your data storage platform

  • Design consistent integration tests - Review data that exists in dbnl to design tests for consistency of app status.

    • Common testing strategies for consistency include:

      • Confirming that distributions have the ,

      • Confirming that distributions are ,

      • Confirming that too severely.

  • Execute and review test sessions - After test sessions are completed, you can inspect individual assertions to learn more about the current app behavior.

Setting the scene

As owners of the sentiment classifier, you might want to improve the model's performance, or you might want to change the model to a different one that is more efficient. In both cases, it is important to ensure that the new model is functioning as expected. This tutorial presents a change from a rule-based classic sentiment classifier with one powered by LLMs.

To do this, you identify a fixed benchmark dataset of 10000 tweet in order to ensure that any changes that might be observed between models are caused by the models and not the data.

Below is an example where you fetch these tweets from Snowflake. The fetch_tweets function will return a list of tweets with the columns tweet_id, tweet_content, and ground_truth_sentiment.

Retrieving benchmark tweets from Snowflake
import snowflake.connector

def get_snowflake_connection():
    """Simulate connecting to Snowflake."""
    return snowflake.connector.connect(
        user='user',
        password='password',
        account='account',
        warehouse='warehouse',
        database='database',
        schema='schema'
    )

def fetch_tweets(tweet_ids):
    """Simulate fetching tweets from Snowflake."""
    with get_snowflake_connection() as conn:
        cursor = conn.cursor()
        cursor.execute(
            f"""
            SELECT tweet_id, tweet_content, ground_truth_sentiment
            FROM tweets
            WHERE tweet_id IN ({', '.join(map(str, tweet_ids))})
            """
        )
        tweets = cursor.fetchall()
    return tweets

Define the data schema (run_config)

Sentiment classifier run_config.json
{
    "display_name": "Sentiment Classifier",
    "description": "This run-config is for the sentiment classifier, focused on preserving Sentiment Data while speeding up latency.",
    "row_id": [
        "tweet_id"
    ],
    "columns": [
        {
            "name": "tweet_id",
            "type": "string",
            "description": "Unique identifier for each tweet."
        },
        {
            "name": "tweet_content",
            "type": "string",
            "description": "Tweets related to financial markets."
        },
        {
            "name": "positive_sentiment",
            "type": "float",
            "description": "Measure of positive sentiment, value between 0 and 1."
        },
        {
            "name": "neutral_sentiment",
            "type": "float",
            "description": "Measure of neutral sentiment, value between 0 and 1."
        },
        {
            "name": "negative_sentiment",
            "type": "float",
            "description": "Measure of positive sentiment, value between 0 and 1."
        },
        {
            "name": "ground_truth_sentiment",
            "type": "category",
            "description": "Ground truth sentiment category for the tweet."
        },
        {
            "name": "sentiment_latency_ms",
            "type": "float",
            "description": "Classifier inference time (in milliseconds)."
        }
    ]
}

Prepare the results for dbnl

The run_model_original and run_model_new functions will return a pandas dataframe with the sentiment columns which are then saved to disk.

Calling the sentiment classifiers and creating the dataframes
import time
import pandas as pd

from model import run_model_original, run_model_new, TWEET_IDS

def get_data(model_fn, tweet_id):
    """Simulate running the data through the model."""
    tweet = fetch_tweets([tweet_id])[0]
    tweet_id, tweet_content, ground_truth_sentiment = tweet
    sentiment = model_fn(tweet_content)
    return {
        'tweet_id': tweet_id,
        'positive_sentiment': sentiment['positive'],
        'neutral_sentiment': sentiment['neutral'],
        'negative_sentiment': sentiment['negative'],
        'ground_truth_sentiment': ground_truth_sentiment
    }

def generate_run_results(model_fn):
    """Generate the run results for the model.
    Desired dataframe columns: 'tweet_id', 'tweet_content', 'sentiment_latency_ms', 'positive_sentiment', 'neutral_sentiment', 'negative_sentiment', 'ground_truth_sentiment'.

    The index of the dataframe should be the tweet_id.

    Note, the ground_truth_sentiment should be of type 'category' as it is listed in the run config.
    """
    run_results = []
    tweets = fetch_tweets(TWEET_IDS)
    for tweet in tweets:
        tweet_id, tweet_content, ground_truth_sentiment = tweet
        start_time = time.time()
        sentiment = model_fn(tweet_content)
        end_time = time.time()
        run_results.append({
            'tweet_id': tweet_id,
            'tweet_content': tweet_content,
            'sentiment_latency_ms': (end_time - start_time) * 1000,
            'positive_sentiment': sentiment['positive'],
            'neutral_sentiment': sentiment['neutral'],
            'negative_sentiment': sentiment['negative'],
            'ground_truth_sentiment': ground_truth_sentiment
        })
    df = pd.DataFrame(run_results)
    df['ground_truth_sentiment'] = df['ground_truth_sentiment'].astype('category')
    return df.set_index('tweet_id')

run_results_original = generate_run_results(run_model_original)
run_results_original.to_parquet('Classic Sentiment Classifier.parquet')

run_results_new = generate_run_results(run_model_new)
run_results_new.to_parquet('LLM Sentiment Classifier.parquet')

Latency values are computed externally, but could also be returned from the model itself, along with data such as the number of tokens used by an LLM tool.

These two dataframes contain measurements of the status of the sentiment classifier as measured according to all of the columns. This data provides the necessary information for executing tests to assert similar or aberrant behavior between the two models.

Push the runs to dbnl

  1. Create a project in dbnl.

  2. Create a run configuration in dbnl.

  3. Create a run in that project using that run config and a dataframe from earlier.

Pushing completed runs to dbnl
import os
import glob
import json
from datetime import datetime

import pandas as pd
import dbnl

PROJECT_NAME = 'Sentiment Classifier Demo Project'

# authenticate
dbnl.login(api_token=os.environ['DBNL_API_TOKEN'])

# create a project 
proj = dbnl.get_or_create_project(
    name=f"{PROJECT_NAME}_{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}",
    description="Sentiment Classifier Demo Project"
)

# create a run configuration
run_config_dict = json.load(open('run_config.json'))
run_config = dbnl.create_run_config(
    project=proj,
    **run_config_dict
)

# define a function to send the run data to dbnl
def send_run_data_to_dbnl(filename):
    test_data = pd.read_parquet(filename)
    run = dbnl.create_run(
        project=proj,
        display_name=filename.split('.')[0],
        run_config=run_config,
        metadata=test_data.attrs
    )
    dbnl.report_results(run=run, data=test_data.reset_index())
    dbnl.close_run(run=run)
    return run

# load the two run results from disk and send them to dbnl
filenames = iter(sorted(glob.glob('*.parquet')))
baseline_filename = next(filenames)
v2_filename = next(filenames)
run = send_run_data_to_dbnl(baseline_filename)

Define tests of consistent behavior

Example 1: Test that the means of the distributions have not changed

This is an example test on the positive_sentiment column. The test asserts that the mean of the positive_sentiment column in the new model is within 0.1 of the mean of the positive_sentiment column in the original model.

The value 0.1 represents a rather large average shift in distribution over 10000 tweets -- any change at this level is certainly worth investigating.

{
    "name": "abs_diff_mean__positive_sentiment",
    "description": "This is a abs_diff_mean test for mean applied to positive_sentiment.",
    "tag_names": [],
    "assertion": {
        "name": "less_than",
        "params": {
            "other": 0.1
        }
    },
    "statistic_name": "abs_diff_mean",
    "statistic_params": {},
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.positive_sentiment"
            }
        },
        {
            "select_query_template": {
                "select": "{BASELINE}.positive_sentiment"
            }
        }
    ]
}
Example 2: Test that the distributions are overall similar

This is an example test on the neutral_sentiment column. The test asserts that the distribution of the neutral_sentiment column in the new model is similar to the distribution of the neutral_sentiment column in the original model.

Here, we used the Kolmogorov-Smirnov statistic (with some scaling that we have developed internally) to measure the similarity of these distributions. The threshold 0.1 is something that must be learned after some practice and experimentation, as the quantity has no physical interpretation. Please reach out to your Applied Customer Engineer if you would like guidance on this process.

{
    "name": "scaled_ks_stat__neutral_sentiment",
    "description": "This is a discrepancy test for neutral_sentiment.",
    "tag_names": [],
    "assertion": {
        "name": "less_than",
        "params": {
            "other": 0.1
        }
    },
    "statistic_name": "scaled_ks_stat",
    "statistic_params": {},
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.neutral_sentiment"
            }
        },
        {
            "select_query_template": {
                "select": "{BASELINE}.neutral_sentiment"
            }
        }
    ]
}
Example 3: Test that specific results have not deviated too severely

This is an example test on the negative_sentiment column. The test asserts that on a tweet by tweet basis, the negative_sentiment column in the new model is on average within 0.1 of the negative_sentiment column in the original model.

In this situation, if any single result were to vary (either increasing or decreasing negative sentiment) by 0.1, that would be okay. But if the average change were 0.1, that implies a rather consistent deviation in behavior for a large number of results. Because this is row-wise operation, increases and decreases in sentiment both contribute equally (not canceling each other out).

{
    "name": "mean_matched_abs_diff__negative_sentiment",
    "description": "This is a test for mean_matched_abs_diff applied to negative_sentiment.",
    "tag_names": [],
    "assertion": {
        "name": "less_than",
        "params": {
            "other": 0.1
        }
    },
    "statistic_name": "mean",
    "statistic_params": {},
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "abs({EXPERIMENT}.negative_sentiment - {BASELINE}.negative_sentiment)"
            }
        }
    ]
}
Using the Web Application to define tests

The above examples show how to use a test payload to define tests. This is also possible with the UI. Here is an example of how to use the UI to conduct Test 2 above:

Execute the tests

After defining the tests (examples above of the types of tests defined in this tutorial), you can execute them using the SDK. This process involves defining a baseline run, defining the tests, and submitting the new run which will automatically trigger the tests.

Defining a baseline

Set the run as the baseline run for the project. This will be used as the baseline for all tests which compare the EXPERIMENT run to the BASELINE run.

dbnl.experimental.set_run_as_baseline(run=run)
Defining the tests

The tests are defined in the test_payloads.json file. The prepare_incomplete_test_spec_payload function is used to prepare the test spec payload for the test. This adds the project id (and tag ids if they exist) to the test spec payload.

test_payloads = json.load(open("test_payloads.json"))
for test_payload in test_payloads:
    test_spec_dict = dbnl.experimental.prepare_incomplete_test_spec_payload(
        test_spec_dict=test_payload, 
        project_id=proj.id
    )
    dbnl.experimental.create_test(test_spec_dict=test_spec_dict)
Submitting the v2 run, automatically triggering a test

The send_run_data_to_dbnl function is used to send the run data to dbnl. This function will automatically trigger the tests defined in the test_payloads.json file. The run submitted is treated as the EXPERIMENT run.

run = send_run_data_to_dbnl(v2_filename)

Reviewing a completed test session

The first completed test session, comparing the original sentiment classifier against the high F1-score LLM-based sentiment classifier, shows some passing assertions and some failed assertions.

Digging into these failed test assertions, you can see that there is a significant difference in behavior both on the result-by-result level and at the level of the full distribution. For the matched absolute difference, you see that the distribution of neutral sentiment has shifted well beyond the threshold.

When you study the KS test for the positive sentiment, you see a rather massive change in distribution; the original unimodal distribution on the right has shifted to a bimodal distribution on the left.

Clicking on the Compare button takes us to the page used to study how two runs relate through the various columns. In particular, we can use the scatter plot functionality to study how the sentiment is a function of the tweets themselves.

You and your teammates can have a lively discussion regarding whether this is in fact desired behavior. But this test session alerted you to the change in behavior and provided follow up analysis on your data to understand how this change manifested and whether it is a problem.

You define the schema of the columns present in the test data through your . This is important to ensure that the data is in the correct format before it is sent to dbnl.

Note that, in this example, the tweets are indexed based on an arbitrary ordering; that ordering is referred to as the tweet_id. You have the ability to between runs using this tweet_id, which enables .

After following the instructions, you follow the steps below to push data to DBNL:

Authenticate with dbnl (API Token available at )

At this point, you can view and compare the runs for the two models within the dbnl website: the output of the above code contains the appropriate urls. You also can use the website or the SDK to create tests of consistent behavior. We discuss these tests in this section, demonstrating examples of three different types of test configurations below. Each test has an assertion which has some constant threshold that may seem arbitrary at a glance. Determining the proper thresholds comes with experience and learning the degree to which statistics change. You can read more about , our product roadmap includes such features as automatically determining threshold based on user preferences.

In this test session, you see that the means of the distributions have not changed, but that the matched runs tests and nonparametric tests indicate significant change in app behavior.
Individual results have a neutral sentiment which have shifted by 0.23 on average. This change from the original sentiment classifier to the LLM-powered sentiment classifier was beyond the stated threshold.
This test of distributional similarity shows a striking shift from a unimodal distribution for the original sentiment classifier to a bimodal distribution for the LLM-based sentiment classifier.
This failed test assertion led us to the realization that the LLM-based sentiment classifier, for some reason, thinks long tweets usually have positive sentiment and short tweets lack positive sentiment.
run configuration
match individual results
tests between individual results
Getting Started
https://app.dbnl.com/tokens
same statistics
overall similar
specific results have not deviated
choosing a threshold
3MB
hello-world-2024-11-18-18-01-58.zip
archive
Hello World Tutorial files