Setting the Scene
These are the objects and compute resources used to design this summarization app.
Overview
In this tutorial, we will demonstrate how to use dbnl to automatically evaluate the consistency of summarization output on a fixed set of documents. We will start by answering the question of why one would be concerned of the consistency of generated text. We will cover generating summaries from self-hosted and third party language model providers. We will review the trade-offs of using automatic metrics and then proceed to generate and store them. We will then demonstrate how to use dbnl in an Integration Test setting to regularly evaluate the consistency of the generated summaries. Finally, we will discuss some limitations of the approach and how to mitigate them.
Motivation
The primary motivation for evaluating the consistency of generated text is because LLMs are inherently stochastic, different prompts and Large Language Models (LLMs) yield different results, and even the same prompts and the same LLMs will also yield different results as LLMs are stochastic. This is especially apparent with third party LLM providers (e.g. OpenAI's GPT-4) where the underlying model changes, and the only way to evaluate the output is through the generated text.
Managing this complexity across a dozen or more metrics and is a difficult task. DBNL was especially designed to help manage this complexity and provide a stateful system of record for the granular, distributional evaluation of stochastic systems.
Implementation Details
We are going to walk through the main components of the system with some example code snippets. That is, the full dataflow from raw documents to summary generation to summary evaluation to submitting evaluation metrics to dbnl to integration testing is going to be covered, but the code snippets will be illustrative and not exhaustive.
Raw Documents
Suppose that there is a business use case to summarize financial documents from the SEC. For illustrative purposes, we constructed an example of raw documents from the SEC Financial Statements and Notes Data Sets website.
Summary Generation
We will use the ChatCompletion API standard to generate summaries daily. The ChatCompletion API is a simple API that can be used to generate summaries from a given prompt. Roughly speaking, what's necessary to generate a summary are four things:
the system prompt
the summary prompt
the raw document to be summarized
the LLM endpoint
The prompt used to generate the summary is a combination of the first three items above, and the endpoints used can be a self-hosted LLM or something available via OpenAI, Azure OpenAI, Amazon Bedrock, or any number of other startup LLM providers (Together AI). Bear in mind privacy and security concerns when using third party LLMs.
Then we can use the following code snippet to generate a summary using GPT-3.5 Turbo 16k (the length of the context) hosted on Microsoft Azure. Note that this example uses the ChatCompletion API standard with function calling and is applied to other endpoints later on in this tutorial.
Examples of other models that offer the ChatCompletions API are OpenAI's GPT-4, Anthropic's Claude models, several of Mistral AI's models and other open source models.
# messages is a list of dictionaries (example above)
response, latency_ms = await generate_summary(messages, llm_key, sem)
Prompt Engineering
The prompt is the easiest part of the summary generation pipeline to alter. Selection of prompts should be more than a simple "vibe check" and should be based on an analysis that is omitted from this tutorial but will exist in our future tutorials.
Below are examples for system prompts and summary prompts:
Frameworks
There are over a dozen frameworks and libraries out there which offer automatic metrics and dataflow assistance for generating summaries and processing their evaluations. In this tutorial we generate some custom code, but dbnl can be used in conjunction with any of these frameworks.
Summary Evaluation
The next step is to evaluate the generated summary and we will use LLMs to do so. In fact, we will be asking an LLM to evaluate the summary generated by another LLM and use that as a proxy for human evaluation. This is a common practice in the field of summarization and is a good way to get a sense of the quality of the generated summary.
In our tutorial, we use both generic text metrics (which don't require any knowledge of the raw document and can be applied to any text) and summary metrics which assess the quality of the summary and require the raw document. All metrics are on a Likert scale, requesting a number between 1 and 5, where higher is "better".
We enforce that the API response falls into this by leveraging the pydantic
and instructor
libraries. Behind the scenes, it's just using the requests
library to make the API call and we have an example of an api request below.
Submitting Evaluation Metrics to dbnl
Run Configuration
You define the schema of the columns present in the test data through your run configuration. This is important to ensure that the data is in the correct format before it is sent to dbnl.
Notice we're also designing this to capture latency, in addition to all the metrics described above. This is important for understanding the performance of the system and can be used to detect performance change / degradation over time.
Push the runs to dbnl
Similar to the Hello World and the Trading Strategy tutorials, we will push the runs to dbnl. The code snippet below demonstrates how to push the runs to dbnl once the test data is prepared.
Was this helpful?