Integration testing for text summarization
After our summarization app is deployed, we conduct nightly integration tests to confirm its continued acceptable behavior.
Was this helpful?
After our summarization app is deployed, we conduct nightly integration tests to confirm its continued acceptable behavior.
Was this helpful?
Continually testing the deployed app for consistency can help ensure that its behavior continues to match the expectation set at deployment. In this situation, we are considering our OpenAI - Basic Summary which was the winner of the constrained optimization; in this new integration testing focused project we create a run recording its behavior on the day of the deployment and title it Day 01 - Best LLM. We set this as the baseline of our project so that incoming runs, conducted on a daily basis, immediately trigger the integration tests and create a new test session.
The tests under consideration are tagged into 3 groups:
Parametric tests of summary quality, which confirm the median of each of the 6 summary quality distributions match the baseline;
Parametric tests of text quality, which confirm that the minimum viable behavior of the 4 text quality distributions that were enforced during the prompt engineering process are maintained (these are the exact same tests from earlier;) and
Nonparametric tests for consistency, using a scaled version of the Chi-squared statistic, on all 10 columns, confirming no significant change in distributions (at the 0.25 significance level).
When nightly runs are submitted to dbnl, the completed test sessions will appear on the project page, as shown below.
Each day that an integration test passes, we can be assured that no significant departure from expected behavior has occurred. On day 7, it appears that some aberrant behavior was observed – the failed assertions are shown in the figure below.
When we click into each of these failed assertions, we can see that the deviation is not massive, but it is enough to trigger a failure at the 0.25 statistical significance level. A sample is in the figure below.
At this point, several possible options exist.
Adjust the tests - The team could decide that, upon further review of the individual summaries, this is actually acceptable behavior and adjust the statistical thresholds accordingly (to, e.g., 0.35.)
Manually create more data - The team could want to gather more information and manually rerun the 400 summarization prompts in a new run and resulting test session immediately;
Wait for more data in the current workflow - If this app were low risk, the team could wait until tomorrow for another nightly run and review those results when they come in; or
Immediately change apps - The team could deploy an alternate, acceptably performing app into production while offline analysis of this current app is conducted.
In this project, our team takes the 4th option and immediately swaps out the best LLM with the second best LLM from our king-of-the-hill prompt engineering. The day 08 integration test shows that this still shows satisfactory behavior (even if the expected performance may be slightly lower as observed in the prompt engineering.) At this point, that second best LLM from day 08 becomes the new baseline while the previously best LLM is studied offline to better understand this change in behavior.