LogoLogo
AboutBlogLaunch app ↗
v0.23.x
v0.23.x
  • Get Started
  • Overview
  • Getting Access to Distributional
  • Install the Python SDK
  • Quickstart
  • Learning about Distributional
    • Distributional Concepts
    • Why We Test Data Distributions
    • The Flow of Data
  • Using Distributional
    • Projects
    • Runs
      • Reporting Runs
      • Setting a Baseline Run
    • Metrics
    • Tests
      • Creating Tests
        • Using Filters in Tests
        • Available Statistics and Assertions
      • Running Tests
      • Reviewing Tests
        • What Is a Similarity Index?
    • Notifications
    • Access Controls
      • Organization and Namespaces
      • Users and Permissions
      • Tokens
  • Platform
    • Sandbox
    • Self-hosted
      • Architecture
      • Deployment
        • Helm Chart
        • Terraform Module
      • Networking
      • OIDC Authentication
      • Data Security
  • Reference
    • Query Language
      • Functions
    • Python SDK
      • dbnl
      • dbnl.util
      • dbnl.experimental
      • Classes
      • Eval Module
        • Quick Start
        • dbnl.eval
        • dbnl.eval.metrics
        • Application Metric Sets
        • How-To / FAQ
        • LLM-as-judge and Embedding Metrics
        • RAG / Question Answer Example
      • Classes
  • CLI
  • Versions
    • Release Notes
Powered by GitBook

© 2025 Distributional, Inc. All Rights Reserved.

On this page
  • Where You’ll See It in the UI
  • Why It Matters
  • Hierarchical Structure
  • Test Sessions and Thresholds
  • Key Insights
  • Deep Dive: Column Similarity Details
  • Frequently Asked Questions
  • Example Workflow

Was this helpful?

Export as PDF
  1. Using Distributional
  2. Tests
  3. Reviewing Tests

What Is a Similarity Index?

Similarity Index is a single number between 0 and 100 that quantifies how much your application’s behavior has changed between two runs – a Baseline and an Experiment run. It is Distributional’s core signal for measuring application drift, automatically calculated and available in every Test Session.

A lower score indicates a greater behavioral change in your AI application. Each Similarity Index has accompanying Key Insights with a description to help users understand and act on the behavioral drift that Distributional has detected.

Where You’ll See It in the UI

  • Test Session Summary Page — App-level Similarity Index, results of failed tests, and Key Insights

  • Similarity Report Tab — Breakdown of Similarity Indexes by column and metric

  • Column Details View — Histograms and statistical comparison for specific metrics

  • Tests View — History of Similarity Index-based test pass/fail over time

Why It Matters

When model behavior changes, you need:

  1. A clear signal that drift occurred

  2. An explanation of what changed

  3. A workflow to debug, test, and act

Similarity Index + Key Insights provides all three.

Example:

  • An app’s Similarity Index drops from 93 → 46

  • Key Insight: “Answer similarity has decreased sharply”

  • Metric: levenshtein__generated_answer__expected_answer

Result: Investigate histograms, set test thresholds, adjust model

Hierarchical Structure

Similarity Index operates at three levels:

  • Application Level — Aggregates all lower-level scores

  • Column Level — Individual column-level drift

  • Metric Level — Fine-grained metric change (e.g., readability, latency, BLEU score)

Each level rolls up into the one above it. You can sort by Similarity Index to find the most impacted parts of your app.

Test Sessions and Thresholds

By default, a new DBNL project comes with an Application-level Similarity Index test:

  • Threshold: ≥ 80

  • Failure: Indicates meaningful application behavior change

In the UI:

  • Passed tests are shown in green

  • Failed tests are shown in red with diagnostic details

All past test runs can be reviewed in the test history.

Key Insights

Key Insights are human-readable interpretations of Similarity Index changes. They answer:

“What changed, and does it matter?”

Each Key Insight includes:

  • A plain-language summary: “Distribution substantially drifted to the right”

  • The associated column/metric

  • The Similarity Index for that metric

  • Option to add a test on the spot

Example:

Distribution substantially drifted to the right.

→ Metric: levenshtein__generated_answer__expected_answer

→ Similarity Index: 46

→ Add Test

Insights are prioritized and ordered by impact, helping you triage quickly.

Deep Dive: Column Similarity Details

Clicking into a Key Insight opens a detailed view:

  • Histogram overlays for experiment vs. baseline

  • Summary statistics (mean, median, percentile, std dev)

  • Absolute difference of statistics between runs

  • Links to add similarity or statistical tests on specific metrics

This helps pinpoint whether drift was due to longer answers, slower responses, or changes in generation fidelity.

Frequently Asked Questions

What’s considered “low” similarity?
  • Below 80 = significant drift (default failure threshold)

  • Below 60 = usually signals substantial regression or change

Can I configure the thresholds?

Yes — Similarity Index thresholds can be adjusted, and custom tests can be created at any level (app, column, metric).

Do I need to set anything up to use Similarity Index?

No. For all numeric columns that overlap between Baseline and Experiment runs, and non-numeric columns with defined metrics, this is automatically run.

What columns does Similarity Index apply to?

Only numeric columns and derived metrics (e.g., response time, BLEU, readability). String values are not supported yet.

Example Workflow

  1. Run a test session

  2. Similarity Index < 80 → test fails

  3. Review top-level Key Insights

  4. Click into a metric (e.g., levenshtein__generated_answer__expected_answer)

  5. View distribution shift and statistical breakdown

  6. Add targeted test thresholds to monitor ongoing behavior

  7. Adjust model, prompt, or infrastructure as needed

PreviousReviewing TestsNextNotifications

Was this helpful?