Integration testing for text summarization

After our summarization app is deployed, we conduct nightly integration tests to confirm its continued acceptable behavior.

Integration testing for consistency and viability

Continually testing the deployed app for consistency can help ensure that its behavior continues to match the expectation set at deployment. In this situation, we are considering our OpenAI - Basic Summary which was the winner of the constrained optimization; in this new integration testing focused project we create a run recording its behavior on the day of the deployment and title it Day 01 - Best LLM. We set this as the baseline of our project so that incoming runs, conducted on a daily basis, immediately trigger the integration tests and create a new test session.

The tests under consideration are tagged into 3 groups:

  • Parametric tests of summary quality, which confirm the median of each of the 6 summary quality distributions match the baseline;

  • Parametric tests of text quality, which confirm that the minimum viable behavior of the 4 text quality distributions that were enforced during the prompt engineering process are maintained (these are the exact same tests from earlier;) and

  • Nonparametric tests for consistency, using a scaled version of the Chi-squared statistic, on all 10 columns, confirming no significant change in distributions (at the 0.25 significance level).

Sample nonparametric Chi-squared test payload
{
    "name": "scaled_chi2_stat__essential_information_capture_score",
    "description": "This is a discrepancy test for essential_information_capture_score.",
    "tag_names": [
        "Consistency",
        "Nonparametric",
        "SummaryQuality"
    ],
    "assertion": {
        "name": "less_than",
        "params": {
            "other": 0.25
        }
    },
    "statistic_name": "scaled_chi2_stat",
    "statistic_params": {},
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.essential_information_capture_score"
            }
        },
        {
            "select_query_template": {
                "select": "{BASELINE}.essential_information_capture_score"
            }
        }
    ]
}

Executing and reviewing test sessions

When nightly runs are submitted to dbnl, the completed test sessions will appear on the project page, as shown below.

Test sessions executed nightly for 10 days. On day 7, we observed some aberrant behavior in our deployed LLM, so we have swapped in our second best LLM from the prompt engineering until we understand the unexpected behavior.

Each day that an integration test passes, we can be assured that no significant departure from expected behavior has occurred. On day 7, it appears that some aberrant behavior was observed – the failed assertions are shown in the figure below.

Four of the distributions have shown significant deviation from the baseline.

When we click into each of these failed assertions, we can see that the deviation is not massive, but it is enough to trigger a failure at the 0.25 statistical significance level. A sample is in the figure below.

A deviation in behavior was observed. The app has not collapsed, but there is enough shift to warrant additional analysis.

At this point, several possible options exist.

  • Adjust the tests - The team could decide that, upon further review of the individual summaries, this is actually acceptable behavior and adjust the statistical thresholds accordingly (to, e.g., 0.35.)

  • Manually create more data - The team could want to gather more information and manually rerun the 400 summarization prompts in a new run and resulting test session immediately;

  • Wait for more data in the current workflow - If this app were low risk, the team could wait until tomorrow for another nightly run and review those results when they come in; or

  • Immediately change apps - The team could deploy an alternate, acceptably performing app into production while offline analysis of this current app is conducted.

In this project, our team takes the 4th option and immediately swaps out the best LLM with the second best LLM from our king-of-the-hill prompt engineering. The day 08 integration test shows that this still shows satisfactory behavior (even if the expected performance may be slightly lower as observed in the prompt engineering.) At this point, that second best LLM from day 08 becomes the new baseline while the previously best LLM is studied offline to better understand this change in behavior.

Was this helpful?