1 of 6

Testing Strategies

Like in traditional software testing, it is paramount to come up with a testing strategy that has both breadth and depth. Such a set of tests gives confidence that the AI-powered app is behaving as expected, or it alerts you that the opposite may be true.

To build out a comprehensive testing strategy it is important to come up with a series of assertions and statistics on which to create tests. This section explains several goals when testing and how to create tests to assert desired behavior.

Test That a Given Distribution Has Certain Properties

A common type of test is testing whether a single distribution contains some property of interest. Generally, this means determining whether some statistics for the distribution of interest exceeds some threshold. Some examples of this can be testing the toxicity of a given LLM or the latency for the entire AI-powered application.

This is especially common for development testing, where it is important to test if a proposed app reaches the minimum threshold for what is acceptable.

Example Test Spec

{
    "name": "p95_app_latency_ms",
    "description": "Test the 95th percentile of latency in miliseconds",
    "statistic_name": "percentile",
    "statistic_params": {"percentage": 0.95},
    "assertion": {
        "name": "less_than_or_equal_to",
        "params": {
            "other": 180.0,
        },
    },
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.app_latency_ms"
            }
        },
    ],
}

Test That Distributions Have the Same Statistics

Another type of test is testing if a particular statistic is similar for two different distributions. For example, this can be testing if the absolute difference of medians of sentiment score between the experiment run and the baseline run is small — that is, the scores are close to one another.

Example Test Spec

{
    "name": "median_sentiment_similar",
    "description": "Test the absolute difference of median on sentiment",
    "statistic_name": "abs_diff_median",
    "statistic_params": {},
    "assertion": {
        "name": "close_to",
        "params": {
            "other": 0.0,
            "tolerance": 0.01,
        },
    },
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.positive_sentiment_score"
            }
        },
        {
            "select_query_template": {
                "select": "{BASELINE}.positive_sentiment_score"
            }
        },
    ],
}

Test That Columns Are Similarly Distributed

One general approach to test if two columns are similarly distributed is using a nonparametric statistic. DBNL offers two such statistics: scaled_ks_stat for testing ordinal distributions and scaled_chi2_stat for testing nominal distributions.

Example Test Spec

{
    "name": "discrepancy_of_text_coherence_score",
    "description": "Test the nonparametric discrepancy of the coherence score distributions",
    "statistic_name": "scaled_ks_stat",
    "statistic_params": {},
    "assertion": {
        "name": "less_than_or_equal_to",
        "params": {
            "other": 0.25,
        },
    },
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.coherence_score"
            }
        },
        {
            "select_query_template": {
                "select": "{BASELINE}.coherence_score"
            }
        },
    ],
}

Test That Specific Results Have Matching Behavior

When the results from a run have unique identifiers, one can create a special type of tests for testing matching behavior at a per-result level. One example would be testing the mean of per-result absolute difference does not exceed a threshold value.

Example Test Spec

{
    "name": "test_mean_abs_diff_sentiment",
    "description": "Test mean absolute difference of negative sentiment per result",
    "statistic_name": "mean",
    "statistic_params": {},
    "assertion": {
        "name": "less_than_or_equal_to",
        "params": {
            "other": 0.05,
        },
    },
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "abs({EXPERIMENT}.toxicity_score - {BASELINE}.toxicity_score)"
            }
        },
    ],
}

Test That Distributions Are Not the Same

Another testing case is testing whether two distributions are not the same. Such a test involves the same statistics as a test of consistency, but a different assertion. One example could be to change the assertion from close_to to greater_than and thereby state that a passed test requires a difference bigger than some threshold.

Example Test Spec

{
    "name": "test_lower_toxicity_score",
    "description": "Test mean of the toxicity score is lower than baseline",
    "statistic_name": "diff_mean",
    "statistic_params": {},
    "assertion": {
        "name": "less_than",
        "params": {
            "other": -0.1,
        },
    },
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.toxicity_score"
            }
        },
        {
            "select_query_template": {
                "select": "{BASELINE}.toxicity_score"
            }
        },
    ],
}