LogoLogo
AboutBlogLaunch app ↗
v0.20.x
v0.20.x
  • Introduction to AI Testing
  • Welcome to Distributional
  • Motivation
  • What is AI Testing?
  • Stages in the AI Software Development Lifecycle
    • Components of AI Testing
  • Distributional Testing
  • Getting Access to Distributional
  • Learning about Distributional
    • The Distributional Framework
    • Defining Tests in Distributional
      • Automated Production test creation & execution
      • Knowledge-based test creation
      • Comprehensive testing with Distributional
    • Reviewing Test Sessions and Runs in Distributional
      • Reviewing and recalibrating automated Production tests
      • Insights surfaced elsewhere on Distributional
      • Notifications
    • Data in Distributional
      • The flow of data
      • Components and the DAG for root cause analysis
      • Uploading data to Distributional
      • Living in your VPC
  • Using Distributional
    • Getting Started
    • Access
      • Organization and Namespaces
      • Users and Permissions
      • Tokens
    • Data
      • Data Objects
      • Run-Level Data
      • Data Storage Integrations
      • Data Access Controls
    • Testing
      • Creating Tests
        • Test Page
        • Test Drawer Through Shortcuts
        • Test Templates
        • SDK
      • Defining Assertions
      • Production Testing
        • Auto-Test Generation
        • Recalibration
        • Notable Results
        • Dynamic Baseline
      • Testing Strategies
        • Test That a Given Distribution Has Certain Properties
        • Test That Distributions Have the Same Statistics
        • Test That Columns Are Similarly Distributed
        • Test That Specific Results Have Matching Behavior
        • Test That Distributions Are Not the Same
      • Executing Tests
        • Manually Running Tests Via UI
        • Executing Tests Via SDK
      • Reviewing Tests
      • Using Filters
        • Filters in the Compare Page
        • Filters in Tests
    • Python SDK
      • Quick Start
      • Functions
        • login
        • Project
          • create_project
          • copy_project
          • export_project_as_json
          • get_project
          • get_or_create_project
          • import_project_from_json
        • Run Config
          • create_run_config
          • get_latest_run_config
          • get_run_config
          • get_run_config_from_latest_run
        • Run Results
          • get_column_results
          • get_scalar_results
          • get_results
          • report_column_results
          • report_scalar_results
          • report_results
        • Run
          • close_run
          • create_run
          • get_run
          • report_run_with_results
        • Baseline
          • create_run_query
          • get_run_query
          • set_run_as_baseline
          • set_run_query_as_baseline
        • Test Session
          • create_test_session
      • Objects
        • Project
        • RunConfig
        • Run
        • RunQuery
        • TestSession
        • TestRecalibrationSession
        • TestGenerationSession
        • ResultData
      • Experimental Functions
        • create_test
        • get_tests
        • get_test_sessions
        • wait_for_test_session
        • get_or_create_tag
        • prepare_incomplete_test_spec_payload
        • create_test_recalibration_session
        • wait_for_test_recalibration_session
        • create_test_generation_session
        • wait_for_test_generation_session
      • Eval Module
        • Quick Start
        • Application Metric Sets
        • How-To / FAQ
        • LLM-as-judge and Embedding Metrics
        • RAG / Question Answer Example
        • Eval Module Functions
          • Index of functions
          • eval
          • eval.metrics
    • Notifications
    • Release Notes
  • Tutorials
    • Instructions
    • Hello World (Sentiment Classifier)
    • Trading Strategy
    • LLM Text Summarization
      • Setting the Scene
      • Prompt Engineering
      • Integration testing for text summarization
      • Practical considerations
Powered by GitBook

© 2025 Distributional, Inc. All Rights Reserved.

On this page
  • Designing the constrained optimization problem
  • Executing the prompt engineering

Was this helpful?

Export as PDF
  1. Tutorials
  2. LLM Text Summarization

Prompt Engineering

We execute a king-of-the-hill constrained optimization process to identify a high performing, suitably coherent summarization app.

Our goal is to build a summarization app “from scratch,” which will require all of the structure defined in the earlier Setting the Scene discussion. It also will require Prompt Engineering, which is conducted in this section. In particular, our app development team has been told to conduct prompt engineering iteratively. The team is also told to consider both Mistral-7b and OpenAI’s ChatGPT-3.5 as possible LLM engines.

Designing the constrained optimization problem

As can be seen in the run config for these runs, there is a column called summary_total_score which is the sum of the 6 summary quality columns:

  • summary_coherence_score,

  • content_relevance_score,

  • bias_assessment_score,

  • logical_sequencing_score,

  • essential_information_capture_score, and

  • bias_evaluation_score.

This column is the quantity by which we define performance of our text summarization app; in particular, the king-of-the-hill winner during each prompt engineering comparison will be the prompt+LLM with the higher average summary_total_score.

Test payload for asserting superior performance
{
    "name": "diff_mean__summary_total_score",
    "description": "This is a diff_mean test for summary_total_score.",
    "tag_names": [],
    "assertion": {
        "name": "greater_than",
        "params": {
            "other": 0
        }
    },
    "statistic_name": "diff_mean",
    "statistic_params": {},
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.summary_total_score"
            }
        },
        {
            "select_query_template": {
                "select": "{BASELINE}.summary_total_score"
            }
        }
    ]
}

In addition to this summary, we also impose 4 checks on text viability (not summary quality) that all possible apps must satisfy before being considered. The 10th percentile of each of the following columns must be greater than 1 (recall these are evaluated on a 1-5 scale)

  • text_fluency_score,

  • text_toxicity_score,

  • sentiment_assessment_score, and

  • reading_complexity_score.

This assures us none of our possible apps violated expectations around, e.g., toxicity, even if it has a high summarization performance.

Example test payload for minimal text viability
{
    "name": "percentile__text_fluency_score",
    "description": "This is a percentile test for text_fluency_score.",
    "tag_names": [],
    "assertion": {
        "name": "greater_than_or_equal_to",
        "params": {
            "other": 2
        }
    },
    "statistic_name": "percentile",
    "statistic_params": {
        "percentage": 0.1
    },
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.text_fluency_score"
            }
        }
    ]
}

Executing the prompt engineering

To start the process, the team is told to create a “Critical Summary”, for which they produce the following prompt. After some deliberation, they develop the following:

Offer a summary that evaluates the strengths and weaknesses of the arguments in this document: Raw Document: {raw_document}.

After creating 400 Critical summaries for Mistral-7b, the associated run is pushed to dbnl; it is then set as a baseline, either in code or at Executing Tests Via SDK. Then when the 400 Critical summaries for OpenAI are created and uploaded as a run, the appropriate test session comparing these is automatically conducted. It should look as below:

As we can see, this test session shows all assertions passing, which informs us that the Critical prompt in OpenAI both outperforms the Critical prompt in Mistral and none of the minimum quality standards are violated. At this point, OpenAI - Critical can be made the new baseline in our testing protocols.

After this Critical summary, the team takes turns each trying to generate an Analytical summary, Narrative summary, and Basic summary, testing each in both OpenAI and Mistral. This process is encapsulated in the below list of completed test sessions.

When a test passes, we know that either the performance is superior, or there is a violation of minimum text quality standards. In particular, it can be seen below that the OpenAI - Analytical Summary does have a superior performance to the baseline, but that it does not meet the minimum toxicity standards.

By the end of this iterative prompt engineering process, the OpenAI - Basic Summary is the winner, and is the candidate considered for production. Mistral - Basic Summary was considered the second-best candidate, which will be relevant during our integration testing discussion.

PreviousSetting the SceneNextIntegration testing for text summarization

Was this helpful?

A completed test session showing that the Critical prompt used in OpenAI outperformed the Critical prompt used in Mistral.
Each test session represents a pairwise comparison of performance between the incumbent best performer and the newest alternative.
Because the toxicity level for this OpenAI- Analytical Summary is too high, the test session is a failure, even though the summarization quality is superior.