When the results from a run have unique identifiers, one can create a special type of tests for testing matching behavior at a per-result level. One example would be testing the mean of per-result absolute difference does not exceed a threshold value.
Example Test Spec
{"name":"test_mean_abs_diff_sentiment","description":"Test mean absolute difference of negative sentiment per result","statistic_name":"mean","statistic_params":{},"assertion":{"name":"less_than_or_equal_to","params":{"other":0.05,},},"statistic_inputs":[{"select_query_template":{"select":"abs({EXPERIMENT}.toxicity_score - {BASELINE}.toxicity_score)"}},],}
Example Test on mean of absolute difference of toxicity_score