LogoLogo
AboutBlogLaunch app ↗
v0.20.x
v0.20.x
  • Introduction to AI Testing
  • Welcome to Distributional
  • Motivation
  • What is AI Testing?
  • Stages in the AI Software Development Lifecycle
    • Components of AI Testing
  • Distributional Testing
  • Getting Access to Distributional
  • Learning about Distributional
    • The Distributional Framework
    • Defining Tests in Distributional
      • Automated Production test creation & execution
      • Knowledge-based test creation
      • Comprehensive testing with Distributional
    • Reviewing Test Sessions and Runs in Distributional
      • Reviewing and recalibrating automated Production tests
      • Insights surfaced elsewhere on Distributional
      • Notifications
    • Data in Distributional
      • The flow of data
      • Components and the DAG for root cause analysis
      • Uploading data to Distributional
      • Living in your VPC
  • Using Distributional
    • Getting Started
    • Access
      • Organization and Namespaces
      • Users and Permissions
      • Tokens
    • Data
      • Data Objects
      • Run-Level Data
      • Data Storage Integrations
      • Data Access Controls
    • Testing
      • Creating Tests
        • Test Page
        • Test Drawer Through Shortcuts
        • Test Templates
        • SDK
      • Defining Assertions
      • Production Testing
        • Auto-Test Generation
        • Recalibration
        • Notable Results
        • Dynamic Baseline
      • Testing Strategies
        • Test That a Given Distribution Has Certain Properties
        • Test That Distributions Have the Same Statistics
        • Test That Columns Are Similarly Distributed
        • Test That Specific Results Have Matching Behavior
        • Test That Distributions Are Not the Same
      • Executing Tests
        • Manually Running Tests Via UI
        • Executing Tests Via SDK
      • Reviewing Tests
      • Using Filters
        • Filters in the Compare Page
        • Filters in Tests
    • Python SDK
      • Quick Start
      • Functions
        • login
        • Project
          • create_project
          • copy_project
          • export_project_as_json
          • get_project
          • get_or_create_project
          • import_project_from_json
        • Run Config
          • create_run_config
          • get_latest_run_config
          • get_run_config
          • get_run_config_from_latest_run
        • Run Results
          • get_column_results
          • get_scalar_results
          • get_results
          • report_column_results
          • report_scalar_results
          • report_results
        • Run
          • close_run
          • create_run
          • get_run
          • report_run_with_results
        • Baseline
          • create_run_query
          • get_run_query
          • set_run_as_baseline
          • set_run_query_as_baseline
        • Test Session
          • create_test_session
      • Objects
        • Project
        • RunConfig
        • Run
        • RunQuery
        • TestSession
        • TestRecalibrationSession
        • TestGenerationSession
        • ResultData
      • Experimental Functions
        • create_test
        • get_tests
        • get_test_sessions
        • wait_for_test_session
        • get_or_create_tag
        • prepare_incomplete_test_spec_payload
        • create_test_recalibration_session
        • wait_for_test_recalibration_session
        • create_test_generation_session
        • wait_for_test_generation_session
      • Eval Module
        • Quick Start
        • Application Metric Sets
        • How-To / FAQ
        • LLM-as-judge and Embedding Metrics
        • RAG / Question Answer Example
        • Eval Module Functions
          • Index of functions
          • eval
          • eval.metrics
    • Notifications
    • Release Notes
  • Tutorials
    • Instructions
    • Hello World (Sentiment Classifier)
    • Trading Strategy
    • LLM Text Summarization
      • Setting the Scene
      • Prompt Engineering
      • Integration testing for text summarization
      • Practical considerations
Powered by GitBook

© 2025 Distributional, Inc. All Rights Reserved.

On this page
  • Integration testing for consistency and viability
  • Executing and reviewing test sessions

Was this helpful?

Export as PDF
  1. Tutorials
  2. LLM Text Summarization

Integration testing for text summarization

After our summarization app is deployed, we conduct nightly integration tests to confirm its continued acceptable behavior.

PreviousPrompt EngineeringNextPractical considerations

Was this helpful?

Integration testing for consistency and viability

Continually testing the deployed app for consistency can help ensure that its behavior continues to match the expectation set at deployment. In this situation, we are considering our OpenAI - Basic Summary which was the winner of the constrained optimization; in this new integration testing focused project we create a run recording its behavior on the day of the deployment and title it Day 01 - Best LLM. We set this as of our project so that incoming runs, conducted on a daily basis, immediately trigger the integration tests and create a new test session.

The tests under consideration are tagged into 3 groups:

  • Parametric tests of summary quality, which confirm the median of each of the 6 summary quality distributions match the baseline;

  • Parametric tests of text quality, which confirm that the minimum viable behavior of the 4 text quality distributions that were enforced during the prompt engineering process are maintained (these are the exact same tests from earlier;) and

  • Nonparametric tests for consistency, using a scaled version of the Chi-squared statistic, on all 10 columns, confirming no significant change in distributions (at the 0.25 significance level).

Sample nonparametric Chi-squared test payload
{
    "name": "scaled_chi2_stat__essential_information_capture_score",
    "description": "This is a discrepancy test for essential_information_capture_score.",
    "tag_names": [
        "Consistency",
        "Nonparametric",
        "SummaryQuality"
    ],
    "assertion": {
        "name": "less_than",
        "params": {
            "other": 0.25
        }
    },
    "statistic_name": "scaled_chi2_stat",
    "statistic_params": {},
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.essential_information_capture_score"
            }
        },
        {
            "select_query_template": {
                "select": "{BASELINE}.essential_information_capture_score"
            }
        }
    ]
}

Executing and reviewing test sessions

When nightly runs are submitted to dbnl, the completed test sessions will appear on the project page, as shown below.

Test sessions executed nightly for 10 days. On day 7, we observed some aberrant behavior in our deployed LLM, so we have swapped in our second best LLM from the prompt engineering until we understand the unexpected behavior.

Each day that an integration test passes, we can be assured that no significant departure from expected behavior has occurred. On day 7, it appears that some aberrant behavior was observed – the failed assertions are shown in the figure below.

When we click into each of these failed assertions, we can see that the deviation is not massive, but it is enough to trigger a failure at the 0.25 statistical significance level. A sample is in the figure below.

At this point, several possible options exist.

  • Adjust the tests - The team could decide that, upon further review of the individual summaries, this is actually acceptable behavior and adjust the statistical thresholds accordingly (to, e.g., 0.35.)

  • Manually create more data - The team could want to gather more information and manually rerun the 400 summarization prompts in a new run and resulting test session immediately;

  • Wait for more data in the current workflow - If this app were low risk, the team could wait until tomorrow for another nightly run and review those results when they come in; or

  • Immediately change apps - The team could deploy an alternate, acceptably performing app into production while offline analysis of this current app is conducted.

In this project, our team takes the 4th option and immediately swaps out the best LLM with the second best LLM from our king-of-the-hill prompt engineering. The day 08 integration test shows that this still shows satisfactory behavior (even if the expected performance may be slightly lower as observed in the prompt engineering.) At this point, that second best LLM from day 08 becomes the new baseline while the previously best LLM is studied offline to better understand this change in behavior.

Four of the distributions have shown significant deviation from the baseline.
A deviation in behavior was observed. The app has not collapsed, but there is enough shift to warrant additional analysis.
the baseline