Setting the Scene

These are the objects and compute resources used to design this summarization app.

Overview

In this tutorial, we will demonstrate how to use dbnl to automatically evaluate the consistency of summarization output on a fixed set of documents. We will start by answering the question of why one would be concerned of the consistency of generated text. We will cover generating summaries from self-hosted and third party language model providers. We will review the trade-offs of using automatic metrics and then proceed to generate and store them. We will then demonstrate how to use dbnl in an Integration Test setting to regularly evaluate the consistency of the generated summaries. Finally, we will discuss some limitations of the approach and how to mitigate them.

Motivation

The primary motivation for evaluating the consistency of generated text is because LLMs are inherently stochastic, different prompts and Large Language Models (LLMs) yield different results, and even the same prompts and the same LLMs will also yield different results as LLMs are stochastic. This is especially apparent with third party LLM providers (e.g. OpenAI's GPT-4) where the underlying model changes, and the only way to evaluate the output is through the generated text.

Managing this complexity across a dozen or more metrics and is a difficult task. DBNL was especially designed to help manage this complexity and provide a stateful system of record for the granular, distributional evaluation of stochastic systems.

Implementation Details

We are going to walk through the main components of the system with some example code snippets. That is, the full dataflow from raw documents to summary generation to summary evaluation to submitting evaluation metrics to dbnl to integration testing is going to be covered, but the code snippets will be illustrative and not exhaustive.

Raw Documents

Suppose that there is a business use case to summarize financial documents from the SEC. For illustrative purposes, we constructed an example of raw documents from the SEC Financial Statements and Notes Data Sets website.

Excerpt from a Raw Document
Swap Risks The Fund and the Underlying Funds may enter into various swap agreements and, other than total return swap agreements (as discussed herein), such agreements are expected to be utilized by the Fund, if at all, for hedging purposes. All of these agreements are considered derivatives. Swap agreements are two-party contracts under which the fund and a counterparty, such as a broker or dealer, agree to exchange the returns (or differentials in rates of return) earned or realized on an agreed-upon underlying asset or investment over the term of the swap. The use of swap transactions is a highly specialized activity which involves strategies and risks different from those associated with ordinary portfolio security transactions. If the Adviser, Subadviser or an Underlying Funds investment adviser is incorrect in its forecasts of default risks, market spreads, liquidity or other applicable factors or events...

Summary Generation

We will use the ChatCompletion API standard to generate summaries daily. The ChatCompletion API is a simple API that can be used to generate summaries from a given prompt. Roughly speaking, what's necessary to generate a summary are four things:

  • the system prompt

  • the summary prompt

  • the raw document to be summarized

  • the LLM endpoint

The prompt used to generate the summary is a combination of the first three items above, and the endpoints used can be a self-hosted LLM or something available via OpenAI, Azure OpenAI, Amazon Bedrock, or any number of other startup LLM providers (Together AI). Bear in mind privacy and security concerns when using third party LLMs.

Example of ChatCompletion Messages Payload

Then we can use the following code snippet to generate a summary using GPT-3.5 Turbo 16k (the length of the context) hosted on Microsoft Azure. Note that this example uses the ChatCompletion API standard with function calling and is applied to other endpoints later on in this tutorial.

Examples of other models that offer the ChatCompletions API are OpenAI's GPT-4, Anthropic's Claude models, several of Mistral AI's models and other open source models.

Implementation of `generate_summary`

Prompt Engineering

The prompt is the easiest part of the summary generation pipeline to alter. Selection of prompts should be more than a simple "vibe check" and should be based on an analysis that is omitted from this tutorial but will exist in our future tutorials.

Below are examples for system prompts and summary prompts:

Examples of a System Prompt
Examples of a Summary Prompt

Frameworks

There are over a dozen frameworks and libraries out there which offer automatic metrics and dataflow assistance for generating summaries and processing their evaluations. In this tutorial we generate some custom code, but dbnl can be used in conjunction with any of these frameworks.

Summary Evaluation

The next step is to evaluate the generated summary and we will use LLMs to do so. In fact, we will be asking an LLM to evaluate the summary generated by another LLM and use that as a proxy for human evaluation. This is a common practice in the field of summarization and is a good way to get a sense of the quality of the generated summary.

In our tutorial, we use both generic text metrics (which don't require any knowledge of the raw document and can be applied to any text) and summary metrics which assess the quality of the summary and require the raw document. All metrics are on a Likert scale, requesting a number between 1 and 5, where higher is "better".

We enforce that the API response falls into this by leveraging the pydantic and instructor libraries. Behind the scenes, it's just using the requests library to make the API call and we have an example of an api request below.

Generic Text Metric Evaluation Prompts
Summary Metric Evaluation Prompts
Example Payload to LLM for Summary Evaluation

Submitting Evaluation Metrics to dbnl

Run Configuration

You define the schema of the columns present in the test data through your run configuration. This is important to ensure that the data is in the correct format before it is sent to dbnl.

Notice we're also designing this to capture latency, in addition to all the metrics described above. This is important for understanding the performance of the system and can be used to detect performance change / degradation over time.

Run Configuration (run_config.json)

Push the runs to dbnl

Similar to the Hello World and the Trading Strategy tutorials, we will push the runs to dbnl. The code snippet below demonstrates how to push the runs to dbnl once the test data is prepared.

Pushing completed runs to dbnl

Was this helpful?