Our tutorial focuses on the minimum factors required to facilitate testing, but here we discuss the complexity of an actual process.
Remember that the code snippets and api requests provided above are illustrative, offering a high-level overview. The actual implementation will depend on the specific requirements of your project. Prompts used here are for demonstration purposes; effective prompt generation often necessitates domain expertise and careful analysis.
Evaluating the consistency of generated text involves significant complexity. The combined use of various metrics in a single API request, as demonstrated, might not always be ideal. Depending on your specific use case, it may be more effective to evaluate each metric separately to gain a more detailed understanding of the performance and quality of the summaries.
Distributional's framework is designed to handle these complexities by providing a systematic approach to measure and analyze the stochastic nature of LLMs. This allows for the detection of non-stationary shifts in third-party applications, ensuring that changes or degradations in performance are identified promptly and accurately. By setting appropriate tests and thresholds for assertions, users can monitor and validate the consistency and quality of LLM outputs, supporting the ongoing maintenance and improvement of these models.
In this advanced tutorial, we demonstrate how to use dbnl to automatically evaluate the consistency of summarization output on a fixed set of documents.
The data files required for this tutorial are available in the following files.
While summarization is the focus of this tutorial, the same principles can be applied to any task involving text generation. The goal is to evaluate the consistency of the generated text with the input text. Other tasks involving text generation are entity recognition, question answering, and machine translation.
This tutorial assumes that you have already the following tutorials: Hello World (Sentiment Classifier) and ideally Trading Strategy.
This tutorial requires a good deal of preparation, so it has been divided into the following four sections:
Defining the text summarization problem of interest, including the data source and the metrics,
Creating a constrained optimization problem to govern the development of a text summarization app in dbnl,
Managing the integration testing process for consistent testing after such an app has been created,
Practical considerations which would arise when actually building an LLM summarization tool.
We execute a king-of-the-hill constrained optimization process to identify a high performing, suitably coherent summarization app.
Our goal is to build a summarization app “from scratch,” which will require all of the structure defined in the earlier Setting the Scene discussion. It also will require Prompt Engineering, which is conducted in this section. In particular, our app development team has been told to conduct prompt engineering iteratively. The team is also told to consider both Mistral-7b and OpenAI’s ChatGPT-3.5 as possible LLM engines.
As can be seen in the run config for these runs, there is a column called summary_total_score
which is the sum of the 6 summary quality columns:
summary_coherence_score
,
content_relevance_score
,
bias_assessment_score
,
logical_sequencing_score
,
essential_information_capture_score
, and
bias_evaluation_score
.
This column is the quantity by which we define performance of our text summarization app; in particular, the king-of-the-hill winner during each prompt engineering comparison will be the prompt+LLM with the higher average summary_total_score
.
In addition to this summary, we also impose 4 checks on text viability (not summary quality) that all possible apps must satisfy before being considered. The 10th percentile of each of the following columns must be greater than 1 (recall these are evaluated on a 1-5 scale)
text_fluency_score
,
text_toxicity_score
,
sentiment_assessment_score
, and
reading_complexity_score
.
This assures us none of our possible apps violated expectations around, e.g., toxicity, even if it has a high summarization performance.
To start the process, the team is told to create a “Critical Summary”, for which they produce the following prompt. After some deliberation, they develop the following:
Offer a summary that evaluates the strengths and weaknesses of the arguments in this document: Raw Document: {raw_document}.
After creating 400 Critical summaries for Mistral-7b, the associated run is pushed to dbnl; it is then set as a baseline, either in code or at Executing Tests Via SDK. Then when the 400 Critical summaries for OpenAI are created and uploaded as a run, the appropriate test session comparing these is automatically conducted. It should look as below:
As we can see, this test session shows all assertions passing, which informs us that the Critical prompt in OpenAI both outperforms the Critical prompt in Mistral and none of the minimum quality standards are violated. At this point, OpenAI - Critical can be made the new baseline in our testing protocols.
After this Critical summary, the team takes turns each trying to generate an Analytical summary, Narrative summary, and Basic summary, testing each in both OpenAI and Mistral. This process is encapsulated in the below list of completed test sessions.
When a test passes, we know that either the performance is superior, or there is a violation of minimum text quality standards. In particular, it can be seen below that the OpenAI - Analytical Summary does have a superior performance to the baseline, but that it does not meet the minimum toxicity standards.
By the end of this iterative prompt engineering process, the OpenAI - Basic Summary is the winner, and is the candidate considered for production. Mistral - Basic Summary was considered the second-best candidate, which will be relevant during our integration testing discussion.
These are the objects and compute resources used to design this summarization app.
In this tutorial, we will demonstrate how to use dbnl to automatically evaluate the consistency of summarization output on a fixed set of documents. We will start by answering the question of why one would be concerned of the consistency of generated text. We will cover generating summaries from self-hosted and third party language model providers. We will review the trade-offs of using automatic metrics and then proceed to generate and store them. We will then demonstrate how to use dbnl in an Integration Test setting to regularly evaluate the consistency of the generated summaries. Finally, we will discuss some limitations of the approach and how to mitigate them.
The primary motivation for evaluating the consistency of generated text is because LLMs are inherently stochastic, different prompts and Large Language Models (LLMs) yield different results, and even the same prompts and the same LLMs will also yield different results as LLMs are stochastic. This is especially apparent with third party LLM providers (e.g. OpenAI's GPT-4) where the underlying model changes, and the only way to evaluate the output is through the generated text.
Managing this complexity across a dozen or more metrics and is a difficult task. DBNL was especially designed to help manage this complexity and provide a stateful system of record for the granular, distributional evaluation of stochastic systems.
We are going to walk through the main components of the system with some example code snippets. That is, the full dataflow from raw documents to summary generation to summary evaluation to submitting evaluation metrics to dbnl to integration testing is going to be covered, but the code snippets will be illustrative and not exhaustive.
Suppose that there is a business use case to summarize financial documents from the SEC. For illustrative purposes, we constructed an example of raw documents from the SEC Financial Statements and Notes Data Sets website.
We will use the ChatCompletion API standard to generate summaries daily. The ChatCompletion API is a simple API that can be used to generate summaries from a given prompt. Roughly speaking, what's necessary to generate a summary are four things:
the system prompt
the summary prompt
the raw document to be summarized
the LLM endpoint
The prompt used to generate the summary is a combination of the first three items above, and the endpoints used can be a self-hosted LLM or something available via OpenAI, Azure OpenAI, Amazon Bedrock, or any number of other startup LLM providers (Together AI). Bear in mind privacy and security concerns when using third party LLMs.
Then we can use the following code snippet to generate a summary using GPT-3.5 Turbo 16k (the length of the context) hosted on Microsoft Azure. Note that this example uses the ChatCompletion API standard with function calling and is applied to other endpoints later on in this tutorial.
Examples of other models that offer the ChatCompletions API are OpenAI's GPT-4, Anthropic's Claude models, several of Mistral AI's models and other open source models.
The prompt is the easiest part of the summary generation pipeline to alter. Selection of prompts should be more than a simple "vibe check" and should be based on an analysis that is omitted from this tutorial but will exist in our future tutorials.
Below are examples for system prompts and summary prompts:
There are over a dozen frameworks and libraries out there which offer automatic metrics and dataflow assistance for generating summaries and processing their evaluations. In this tutorial we generate some custom code, but dbnl can be used in conjunction with any of these frameworks.
The next step is to evaluate the generated summary and we will use LLMs to do so. In fact, we will be asking an LLM to evaluate the summary generated by another LLM and use that as a proxy for human evaluation. This is a common practice in the field of summarization and is a good way to get a sense of the quality of the generated summary.
In our tutorial, we use both generic text metrics (which don't require any knowledge of the raw document and can be applied to any text) and summary metrics which assess the quality of the summary and require the raw document. All metrics are on a Likert scale, requesting a number between 1 and 5, where higher is "better".
We enforce that the API response falls into this by leveraging the pydantic
and instructor
libraries. Behind the scenes, it's just using the requests
library to make the API call and we have an example of an api request below.
You define the schema of the columns present in the test data through your run configuration. This is important to ensure that the data is in the correct format before it is sent to dbnl.
Notice we're also designing this to capture latency, in addition to all the metrics described above. This is important for understanding the performance of the system and can be used to detect performance change / degradation over time.
Similar to the Hello World and the Trading Strategy tutorials, we will push the runs to dbnl. The code snippet below demonstrates how to push the runs to dbnl once the test data is prepared.
After our summarization app is deployed, we conduct nightly integration tests to confirm its continued acceptable behavior.
Continually testing the deployed app for consistency can help ensure that its behavior continues to match the expectation set at deployment. In this situation, we are considering our OpenAI - Basic Summary which was the winner of the constrained optimization; in this new integration testing focused project we create a run recording its behavior on the day of the deployment and title it Day 01 - Best LLM. We set this as of our project so that incoming runs, conducted on a daily basis, immediately trigger the integration tests and create a new test session.
The tests under consideration are tagged into 3 groups:
Parametric tests of summary quality, which confirm the median of each of the 6 summary quality distributions match the baseline;
Parametric tests of text quality, which confirm that the minimum viable behavior of the 4 text quality distributions that were enforced during the prompt engineering process are maintained (these are the exact same tests from earlier;) and
Nonparametric tests for consistency, using a scaled version of the Chi-squared statistic, on all 10 columns, confirming no significant change in distributions (at the 0.25 significance level).
When nightly runs are submitted to dbnl, the completed test sessions will appear on the project page, as shown below.
Each day that an integration test passes, we can be assured that no significant departure from expected behavior has occurred. On day 7, it appears that some aberrant behavior was observed – the failed assertions are shown in the figure below.
When we click into each of these failed assertions, we can see that the deviation is not massive, but it is enough to trigger a failure at the 0.25 statistical significance level. A sample is in the figure below.
At this point, several possible options exist.
Adjust the tests - The team could decide that, upon further review of the individual summaries, this is actually acceptable behavior and adjust the statistical thresholds accordingly (to, e.g., 0.35.)
Manually create more data - The team could want to gather more information and manually rerun the 400 summarization prompts in a new run and resulting test session immediately;
Wait for more data in the current workflow - If this app were low risk, the team could wait until tomorrow for another nightly run and review those results when they come in; or
Immediately change apps - The team could deploy an alternate, acceptably performing app into production while offline analysis of this current app is conducted.
In this project, our team takes the 4th option and immediately swaps out the best LLM with the second best LLM from our king-of-the-hill prompt engineering. The day 08 integration test shows that this still shows satisfactory behavior (even if the expected performance may be slightly lower as observed in the prompt engineering.) At this point, that second best LLM from day 08 becomes the new baseline while the previously best LLM is studied offline to better understand this change in behavior.