Prompt Engineering

We execute a king-of-the-hill constrained optimization process to identify a high performing, suitably coherent summarization app.

Our goal is to build a summarization app “from scratch,” which will require all of the structure defined in the earlier Setting the Scene discussion. It also will require Prompt Engineering, which is conducted in this section. In particular, our app development team has been told to conduct prompt engineering iteratively. The team is also told to consider both Mistral-7b and OpenAI’s ChatGPT-3.5 as possible LLM engines.

Designing the constrained optimization problem

As can be seen in the run config for these runs, there is a column called summary_total_score which is the sum of the 6 summary quality columns:

summary_coherence_score,
content_relevance_score,
bias_assessment_score,
logical_sequencing_score,
essential_information_capture_score, and
bias_evaluation_score.

This column is the quantity by which we define performance of our text summarization app; in particular, the king-of-the-hill winner during each prompt engineering comparison will be the prompt+LLM with the higher average summary_total_score.

Test payload for asserting superior performance

{
    "name": "diff_mean__summary_total_score",
    "description": "This is a diff_mean test for summary_total_score.",
    "tag_names": [],
    "assertion": {
        "name": "greater_than",
        "params": {
            "other": 0
        }
    },
    "statistic_name": "diff_mean",
    "statistic_params": {},
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.summary_total_score"
            }
        },
        {
            "select_query_template": {
                "select": "{BASELINE}.summary_total_score"
            }
        }
    ]
}

In addition to this summary, we also impose 4 checks on text viability (not summary quality) that all possible apps must satisfy before being considered. The 10th percentile of each of the following columns must be greater than 1 (recall these are evaluated on a 1-5 scale)

text_fluency_score,
text_toxicity_score,
sentiment_assessment_score, and
reading_complexity_score.

This assures us none of our possible apps violated expectations around, e.g., toxicity, even if it has a high summarization performance.

Example test payload for minimal text viability

{
    "name": "percentile__text_fluency_score",
    "description": "This is a percentile test for text_fluency_score.",
    "tag_names": [],
    "assertion": {
        "name": "greater_than_or_equal_to",
        "params": {
            "other": 2
        }
    },
    "statistic_name": "percentile",
    "statistic_params": {
        "percentage": 0.1
    },
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.text_fluency_score"
            }
        }
    ]
}

Executing the prompt engineering

To start the process, the team is told to create a “Critical Summary”, for which they produce the following prompt. After some deliberation, they develop the following:

Offer a summary that evaluates the strengths and weaknesses of the arguments in this document: Raw Document: {raw_document}.

After creating 400 Critical summaries for Mistral-7b, the associated run is pushed to dbnl; it is then set as a baseline, either in code or at Executing Tests Via SDK. Then when the 400 Critical summaries for OpenAI are created and uploaded as a run, the appropriate test session comparing these is automatically conducted. It should look as below:

As we can see, this test session shows all assertions passing, which informs us that the Critical prompt in OpenAI both outperforms the Critical prompt in Mistral and none of the minimum quality standards are violated. At this point, OpenAI - Critical can be made the new baseline in our testing protocols.

After this Critical summary, the team takes turns each trying to generate an Analytical summary, Narrative summary, and Basic summary, testing each in both OpenAI and Mistral. This process is encapsulated in the below list of completed test sessions.

When a test passes, we know that either the performance is superior, or there is a violation of minimum text quality standards. In particular, it can be seen below that the OpenAI - Analytical Summary does have a superior performance to the baseline, but that it does not meet the minimum toxicity standards.

By the end of this iterative prompt engineering process, the OpenAI - Basic Summary is the winner, and is the candidate considered for production. Mistral - Basic Summary was considered the second-best candidate, which will be relevant during our integration testing discussion.

PreviousSetting the Scene NextIntegration testing for text summarization

Was this helpful?