What Is a Similarity Index?

Similarity Index is a single number between 0 and 100 that quantifies how much your application’s behavior has changed between two runs – a Baseline and an Experiment run. It is Distributional’s core signal for measuring application drift, automatically calculated and available in every Test Session.

A lower score indicates a greater behavioral change in your AI application. Each Similarity Index has accompanying Similarity Insights with a description to help users understand and act on the behavioral drift that Distributional has detected.

Where You’ll See It in the UI

Test Session Summary Page — App-level Similarity Index, results of failed tests, and Similarity Insights
Similarity Report Tab — Breakdown of Similarity Indexes by column and metric
Column Details View — Histograms and statistical comparison for specific metrics
Tests View — History of Similarity Index-based test pass/fail over time

Why It Matters

When model behavior changes, you need:

A clear signal that drift occurred
An explanation of what changed
A workflow to debug, test, and act

Similarity Index + Similarity Insights provides all three.

Example:

An app’s Similarity Index drops from 93 → 46
Similarity Insight: “Answer similarity has decreased sharply”
Metric: levenshtein__generated_answer__expected_answer

Result: Investigate histograms, set test thresholds, adjust model

Hierarchical Structure

Similarity Index operates at three levels:

Application Level — Aggregates all lower-level scores
Column Level — Individual column-level drift
Metric Level — Fine-grained metric change (e.g., readability, latency, BLEU score)

Each level rolls up into the one above it. You can sort by Similarity Index to find the most impacted parts of your app.

Test Sessions and Thresholds

By default, a new DBNL project comes with an Application-level Similarity Index test:

Threshold: ≥ 80
Failure: Indicates meaningful application behavior change

In the UI:

Passed tests are shown in green
Failed tests are shown in red with diagnostic details

All past test runs can be reviewed in the test history.

Similarity Insights

Similarity Insights are human-readable interpretations of Similarity Index changes. They answer:

“What changed, and does it matter?”

Each Similarity Insight includes:

A plain-language summary: “Distribution substantially drifted to the right”
The associated column/metric
The Similarity Index for that metric
Option to add a test on the spot

Example:

Distribution substantially drifted to the right.

→ Metric: levenshtein__generated_answer__expected_answer

→ Similarity Index: 46

→ Add Test

Insights are prioritized and ordered by impact, helping you triage quickly.

Deep Dive: Column Similarity Details

Clicking into a Similarity Insight opens a detailed view:

Histogram overlays for experiment vs. baseline
Summary statistics (mean, median, percentile, std dev)
Absolute difference of statistics between runs
Links to add similarity or statistical tests on specific metrics

This helps pinpoint whether drift was due to longer answers, slower responses, or changes in generation fidelity.

Frequently Asked Questions

What’s considered “low” similarity?

Below 80 = significant drift (default failure threshold)

Below 60 = usually signals substantial regression or change

Can I configure the thresholds?

Yes — Similarity Index thresholds can be adjusted, and custom tests can be created at any level (app, column, metric).

Do I need to set anything up to use Similarity Index?

No. For all numeric columns that overlap between Baseline and Experiment runs, and non-numeric columns with defined metrics, this is automatically run.

What columns does Similarity Index apply to?

Only numeric columns and derived metrics (e.g., response time, BLEU, readability). String values are not supported yet.

Example Workflow

Run a test session
Similarity Index < 80 → test fails
Review top-level Similarity Insights
Click into a metric (e.g., levenshtein__generated_answer__expected_answer)
View distribution shift and statistical breakdown
Add targeted test thresholds to monitor ongoing behavior
Adjust model, prompt, or infrastructure as needed

PreviousReviewing Tests NextNotifications

Was this helpful?