1 of 5

LLM Text Summarization

In this advanced tutorial, we demonstrate how to use dbnl to automatically evaluate the consistency of summarization output on a fixed set of documents.

The data files required for this tutorial are available in the following files.

Applications

While summarization is the focus of this tutorial, the same principles can be applied to any task involving text generation. The goal is to evaluate the consistency of the generated text with the input text. Other tasks involving text generation are entity recognition, question answering, and machine translation.

Prerequisites

This tutorial assumes that you have already the following tutorials: Hello World (Sentiment Classifier) and ideally Trading Strategy.

Organization

This tutorial requires a good deal of preparation, so it has been divided into the following four sections:

Defining the text summarization problem of interest, including the data source and the metrics,
Creating a constrained optimization problem to govern the development of a text summarization app in dbnl,
Managing the integration testing process for consistent testing after such an app has been created,
Practical considerations which would arise when actually building an LLM summarization tool.

Setting the Scene

These are the objects and compute resources used to design this summarization app.

Overview

In this tutorial, we will demonstrate how to use dbnl to automatically evaluate the consistency of summarization output on a fixed set of documents. We will start by answering the question of why one would be concerned of the consistency of generated text. We will cover generating summaries from self-hosted and third party language model providers. We will review the trade-offs of using automatic metrics and then proceed to generate and store them. We will then demonstrate how to use dbnl in an Integration Test setting to regularly evaluate the consistency of the generated summaries. Finally, we will discuss some limitations of the approach and how to mitigate them.

Motivation

The primary motivation for evaluating the consistency of generated text is because LLMs are inherently stochastic, different prompts and Large Language Models (LLMs) yield different results, and even the same prompts and the same LLMs will also yield different results as LLMs are stochastic. This is especially apparent with third party LLM providers (e.g. OpenAI's GPT-4) where the underlying model changes, and the only way to evaluate the output is through the generated text.

Managing this complexity across a dozen or more metrics and is a difficult task. DBNL was especially designed to help manage this complexity and provide a stateful system of record for the granular, distributional evaluation of stochastic systems.

Implementation Details

We are going to walk through the main components of the system with some example code snippets. That is, the full dataflow from raw documents to summary generation to summary evaluation to submitting evaluation metrics to dbnl to integration testing is going to be covered, but the code snippets will be illustrative and not exhaustive.

Raw Documents

Suppose that there is a business use case to summarize financial documents from the SEC. For illustrative purposes, we constructed an example of raw documents from the SEC Financial Statements and Notes Data Sets website.

Excerpt from a Raw Document

Swap Risks The Fund and the Underlying Funds may enter into various swap agreements and, other than total return swap agreements (as discussed herein), such agreements are expected to be utilized by the Fund, if at all, for hedging purposes. All of these agreements are considered derivatives. Swap agreements are two-party contracts under which the fund and a counterparty, such as a broker or dealer, agree to exchange the returns (or differentials in rates of return) earned or realized on an agreed-upon underlying asset or investment over the term of the swap. The use of swap transactions is a highly specialized activity which involves strategies and risks different from those associated with ordinary portfolio security transactions. If the Adviser, Subadviser or an Underlying Funds investment adviser is incorrect in its forecasts of default risks, market spreads, liquidity or other applicable factors or events...

Summary Generation

We will use the ChatCompletion API standard to generate summaries daily. The ChatCompletion API is a simple API that can be used to generate summaries from a given prompt. Roughly speaking, what's necessary to generate a summary are four things:

the system prompt
the summary prompt
the raw document to be summarized
the LLM endpoint

The prompt used to generate the summary is a combination of the first three items above, and the endpoints used can be a self-hosted LLM or something available via OpenAI, Azure OpenAI, Amazon Bedrock, or any number of other startup LLM providers (Together AI). Bear in mind privacy and security concerns when using third party LLMs.

Example of ChatCompletion Messages Payload

[
  {
    "role": "system",
    "content": "Synthesize information from various domains into cohesive, understandable narratives, highlighting key insights and actionable takeaways. Ensure that your synthesis not only informs but also engages users, making the acquisition of knowledge an enjoyable experience."
  },
  {
    "role": "user",
    "content": "Create a summary that weaves the information of this document into a narrative: \n\nRaw Document: \n\nSwap Risks The Fund and the Underlying Funds may enter into various swap agreements and, other than total return swap agreements (as discussed herein), such agreements are expected to be utilized by the Fund, if at all, for hedging purposes. All of these agreements are considered derivatives. Swap agreements are two-party contracts under which the fund and a counterparty, such as a broker or dealer, agree to exchange the returns (or differentials in rates of return) earned or realized on an agreed-upon underlying asset or investment over the term of the swap. The use of swap transactions is a highly specialized activity which involves strategies and risks different from those associated with ordinary portfolio security transactions. If the Adviser, Subadviser or an Underlying Funds investment adviser is incorrect in its forecasts of default risks, market spreads, liquidity or other applicable factors or events..."
  }
]

Then we can use the following code snippet to generate a summary using GPT-3.5 Turbo 16k (the length of the context) hosted on Microsoft Azure. Note that this example uses the ChatCompletion API standard with function calling and is applied to other endpoints later on in this tutorial.

Examples of other models that offer the ChatCompletions API are OpenAI's GPT-4, Anthropic's Claude models, several of Mistral AI's models and other open source models.

# messages is a list of dictionaries (example above)
response, latency_ms = await generate_summary(messages, llm_key, sem)

Implementation of `generate_summary`

from typing import List, Dict
from asyncio import Semaphore

from pydantic import BaseModel, Field
from tenacity import retry, stop_after_attempt, wait_random_exponential
import openai
import instructor

class Summary(BaseModel):
    text: str = Field(..., description="Summary Text")

@retry(wait=wait_random_exponential(multiplier=1, max=60), stop=stop_after_attempt(5))
async def generate_summary(
    messages: List[Dict[str, str]],
    llm_key: str,
    sem: Semaphore,
    ):
    client = openai.AsyncAzureOpenAI(
        api_version="2023-12-01-preview",
        azure_endpoint='http://<subdomain>.openai.azure.com',
    )
    client = instructor.patch(client, mode=instructor.Mode.TOOLS)
    async with sem:
        await asyncio.sleep(60/500)
        start = time.time()
        response = await client.chat.completions.create(
            model='gpt-35-turbo-16k',
            temperature=.8,
            messages=messages,
            response_model=Summary,
        )
        end = time.time()
        latency_ms = (end - start) * 1000
    return response, latency_ms

Prompt Engineering

The prompt is the easiest part of the summary generation pipeline to alter. Selection of prompts should be more than a simple "vibe check" and should be based on an analysis that is omitted from this tutorial but will exist in our future tutorials.

Below are examples for system prompts and summary prompts:

Examples of a System Prompt

[
    {
        "name": "Strategic Planner",
        "system_message_prompt": "Deliver strategic insights and action plans that resonate with specific goals, demonstrating a keen understanding of both the big picture and the critical details. Your advice should be actionable, guiding users toward achieving their objectives with clear, step-by-step strategies."
    },
    {
        "name": "Information Synthesizer",
        "system_message_prompt": "Synthesize information from various domains into cohesive, understandable narratives, highlighting key insights and actionable takeaways. Ensure that your synthesis not only informs but also engages users, making the acquisition of knowledge an enjoyable experience.",
    }
]

Examples of a Summary Prompt

[
    {
        "name": "Basic Summary",
        "description": "Generates a straightforward summary capturing the key points and essential information of the document without additional analysis or interpretation.",
        "summary_prompt_template": "Generate a summary that captures the key points and essential information of this document: \n\nRaw Document: \n\n{raw_document}."
    },
    {
        "name": "Analytical Summary",
        "description": "Produces a summary that not only recounts the main points but also analyzes the implications, significance, or context of the information, adding depth to the understanding of the document.",
        "summary_prompt_template": "Produce a summary that analyzes the implications, significance, or context of the information in this document: \n\nRaw Document: \n\n{raw_document}."
    },
    {
        "name": "Critical Summary",
        "description": "Offers a summary that evaluates the strengths and weaknesses of the document's arguments, providing a critical perspective on the content.",
        "summary_prompt_template": "Offer a summary that evaluates the strengths and weaknesses of the arguments in this document: \n\nRaw Document: \n\n{raw_document}."
    },
    {
        "name": "Narrative Summary",
        "description": "Creates a summary that weaves the document's information into a narrative, making the summary more engaging and story-like, suitable for capturing the essence of narrative texts or storytelling.",
        "summary_prompt_template": "Create a summary that weaves the information of this document into a narrative: \n\nRaw Document: \n\n{raw_document}."
    }
]

Frameworks

There are over a dozen frameworks and libraries out there which offer automatic metrics and dataflow assistance for generating summaries and processing their evaluations. In this tutorial we generate some custom code, but dbnl can be used in conjunction with any of these frameworks.

Summary Evaluation

The next step is to evaluate the generated summary and we will use LLMs to do so. In fact, we will be asking an LLM to evaluate the summary generated by another LLM and use that as a proxy for human evaluation. This is a common practice in the field of summarization and is a good way to get a sense of the quality of the generated summary.

In our tutorial, we use both generic text metrics (which don't require any knowledge of the raw document and can be applied to any text) and summary metrics which assess the quality of the summary and require the raw document. All metrics are on a Likert scale, requesting a number between 1 and 5, where higher is "better".

We enforce that the API response falls into this by leveraging the pydantic and instructor libraries. Behind the scenes, it's just using the requests library to make the API call and we have an example of an api request below.

Generic Text Metric Evaluation Prompts

[
    {
        "name": "text_fluency",
        "prompt": "Evaluate the text's fluency by considering its readability and ease of understanding. Rate the fluency on a scale from 1 to 5, where 1 indicates challenging to read with frequent disruptions in flow, and 5 represents exceptionally fluid with seamless transitions and clear expression."
    },
    {
        "name": "text_toxicity",
        "prompt": "Assess the text for any toxic content that might alienate or offend readers. Rate the text's toxicity from 1 to 5, where 1 denotes contains highly offensive or harmful material, and 5 signifies completely free of toxic content, fostering a respectful and inclusive environment."
    },
    {
        "name": "sentiment_assessment",
        "prompt": "Determine the overall sentiment conveyed by the text. Provide a rating from 1 to 5, where 1 indicates overwhelmingly negative sentiment, and 5 represents unequivocally positive sentiment, based on your analysis."
    },
    {
        "name": "reading_complexity",
        "prompt": "Analyze the text for its level of reading complexity. Rate the complexity from 1 to 5, where 1 indicates highly complex, requiring advanced understanding or specialized knowledge, and 5 denotes easily accessible and understandable by a wide audience."
    }
]

Summary Metric Evaluation Prompts

[
    {
        "name": "summary_coherence",
        "prompt": "Examine the summary's coherence by evaluating its logical flow and consistency. Rate the summary's coherence from 1 to 5, where 1 lacks logical flow and consistency, and 5 maintains exceptional logical flow and consistency throughout."
    },
    {
        "name": "content_relevance",
        "prompt": "Assess the summary's relevance, focusing on how well it encapsulates the key points. Rate from 1 to 5, where 1 fails to capture or misrepresents key points, resulting in a not relevant summary, and 5 is highly relevant, accurately reflecting the core insights."
    },
    {
        "name": "bias_assessment",
        "prompt": "Rate the summary's neutrality by assessing the presence of bias. A score of 5 indicates no bias, and 1 indicates high bias, where the summary unjustly skews or distorts the information from the raw document."
    },
    {
        "name": "logical_sequencing",
        "prompt": "Examine the summary's ability to preserve the logical order and progression of ideas. A rating of 1 indicates a disjointed or illogically structured summary, while a rating of 5 signifies a summary that adeptly mirrors the logical flow of the source material."
    },
    {
        "name": "essential_information_capture",
        "prompt": "Evaluate the summary’s comprehensiveness in encapsulating the crucial information. Rate from 1 to 5, where 1 indicates significant omissions or inclusion of irrelevant details, and 5 represents a concise yet complete encapsulation of essential information."
      },
    {
        "name": "bias_evaluation",
        "prompt": "Critically assess the summary for any indications of partiality or bias. A score of 1 indicates a summary heavily influenced by bias, while a score of 5 reflects a balanced and faithful representation of the source material’s neutrality."
    }
]

Submitting Evaluation Metrics to dbnl

Run Configuration

You define the schema of the columns present in the test data through your run configuration. This is important to ensure that the data is in the correct format before it is sent to dbnl.

Notice we're also designing this to capture latency, in addition to all the metrics described above. This is important for understanding the performance of the system and can be used to detect performance change / degradation over time.

Run Configuration (run_config.json)

{
    "columns": [
        {
            "component": "Application",
            "name": "summary_completion_tokens",
            "type": "int",
            "description": "Number of tokens in the summary"
        },
        {
            "component": "Application",
            "name": "summary_prompt_tokens",
            "type": "int",
            "description": "Number of tokens in the summary prompt"
        },
        {
            "component": "Application",
            "name": "summary_total_tokens",
            "type": "int",
            "description": "Total number of tokens in the summary"
        },
        {
            "component": "Application",
            "name": "metrics_completion_tokens",
            "type": "int",
            "description": "Number of tokens in the metrics"
        },
        {
            "component": "Application",
            "name": "metrics_prompt_tokens",
            "type": "int",
            "description": "Number of tokens in the metrics prompt"
        },
        {
            "component": "Application",
            "name": "metrics_total_tokens",
            "type": "int",
            "description": "Total number of tokens in the metrics"
        },
        {
            "component": "Application",
            "name": "summary_latency_ms",
            "type": "float",
            "description": "Latency in milliseconds for generating the summary"
        },
        {
            "component": "Application",
            "name": "metrics_latency_ms",
            "type": "float",
            "description": "Latency in milliseconds for generating the metrics"
        },
        {
            "component": "SummaryGeneration",
            "name": "raw_document_id",
            "type": "string",
            "description": "Unique identifier for the raw document"
        },
        {
            "component": "SummaryGeneration",
            "name": "raw_document",
            "type": "string",
            "description": "Raw document text that the summary was generated from"
        },
        {
            "component": "SummaryGeneration",
            "name": "summary_text",
            "type": "string",
            "description": "Generated summary text"
        },
        {
            "component": "MetricEvaluation",
            "name": "text_fluency_score",
            "type": "int",
            "description": "Score for text fluency"
        },
        {
            "component": "MetricEvaluation",
            "name": "text_toxicity_score",
            "type": "int",
            "description": "Score for text toxicity"
        },
        {
            "component": "MetricEvaluation",
            "name": "sentiment_assessment_score",
            "type": "int",
            "description": "Score for sentiment assessment"
        },
        {
            "component": "MetricEvaluation",
            "name": "reading_complexity_score",
            "type": "int",
            "description": "Score for reading complexity"
        },
        {
            "component": "MetricEvaluation",
            "name": "summary_coherence_score",
            "type": "int",
            "description": "Score for summary coherence"
        },
        {
            "component": "MetricEvaluation",
            "name": "content_relevance_score",
            "type": "int",
            "description": "Score for content relevance"
        },
        {
            "component": "MetricEvaluation",
            "name": "bias_assessment_score",
            "type": "int",
            "description": "Score for bias assessment"
        },
        {
            "component": "MetricEvaluation",
            "name": "logical_sequencing_score",
            "type": "int",
            "description": "Score for logical sequencing"
        },
        {
            "component": "MetricEvaluation",
            "name": "essential_information_capture_score",
            "type": "int",
            "description": "Score for essential information capture"
        },
        {
            "component": "MetricEvaluation",
            "name": "bias_evaluation_score",
            "type": "int",
            "description": "Score for bias evaluation"
        },
        {
            "component": "MetricEvaluation",
            "name": "summary_total_score",
            "type": "int",
            "description": "Total score for the summary metrics"
        }
    ],
    "description": "This is a summary evaluation run config.",
    "display_name": "Summary Evaluation Demo Run Config",
    "row_id": [
        "raw_document_id"
    ],
    "components_dag": {
        "Application": [],
        "SummaryGeneration": [],
        "MetricEvaluation": []
    }
}

Push the runs to dbnl

Similar to the Hello World and the Trading Strategy tutorials, we will push the runs to dbnl. The code snippet below demonstrates how to push the runs to dbnl once the test data is prepared.

Pushing completed runs to dbnl

import json
import os
from datetime import datetime
import glob

import pandas as pd
import dbnl

def read_json_file(file_path):
    with open(file_path, "r") as file:
        return json.load(file)

def create_datetime_now_suffix(project_name):
    return f"{project_name}_{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"

dbnl.login(api_token=os.environ["DBNL_API_TOKEN"],
           api_url='api.dbnl.com')

PROJECT_NAME = 'LLM Summarization Demo Project'

proj = dbnl.get_or_create_project(
    name=create_datetime_now_suffix(PROJECT_NAME), 
    description="LLM Summarization Integration Testing"
)

run_config = dbnl.create_run_config(
    project=proj, 
    **read_json_file("run_config.json")
)

for i, filename in enumerate(sorted(glob.glob("*.parquet"))):
    test_data = pd.read_parquet(filename)
    run_name = filename.replace(".parquet", "")
    run = dbnl.create_run(
        project=proj,
        display_name=run_name,
        run_config=run_config,
        metadata=test_data.attrs,
    )
    dbnl.report_results(run=run, column_data=test_data.reset_index())
    dbnl.close_run(run=run)
    if i > 0:
        dbnl.create_test_session(experiment_run=run)
    if i == 0:
        dbnl.set_run_as_baseline(run=run)
        test_payloads = read_json_file("test_payloads.json")
        for test_payload in test_payloads:
            test_spec_dict = dbnl.experimental.prepare_incomplete_test_spec_payload(
                test_spec_dict=test_payload, 
                project_id=proj.id
            )
            dbnl.experimental.create_test(test_spec_dict=test_spec_dict)
    if i == 7:
        dbnl.set_run_as_baseline(run=run)

Prompt Engineering

We execute a king-of-the-hill constrained optimization process to identify a high performing, suitably coherent summarization app.

Our goal is to build a summarization app “from scratch,” which will require all of the structure defined in the earlier Setting the Scene discussion. It also will require Prompt Engineering, which is conducted in this section. In particular, our app development team has been told to conduct prompt engineering iteratively. The team is also told to consider both Mistral-7b and OpenAI’s ChatGPT-3.5 as possible LLM engines.

Designing the constrained optimization problem

As can be seen in the run config for these runs, there is a column called summary_total_score which is the sum of the 6 summary quality columns:

summary_coherence_score,
content_relevance_score,
bias_assessment_score,
logical_sequencing_score,
essential_information_capture_score, and
bias_evaluation_score.

This column is the quantity by which we define performance of our text summarization app; in particular, the king-of-the-hill winner during each prompt engineering comparison will be the prompt+LLM with the higher average summary_total_score.

Test payload for asserting superior performance

{
    "name": "diff_mean__summary_total_score",
    "description": "This is a diff_mean test for summary_total_score.",
    "tag_names": [],
    "assertion": {
        "name": "greater_than",
        "params": {
            "other": 0
        }
    },
    "statistic_name": "diff_mean",
    "statistic_params": {},
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.summary_total_score"
            }
        },
        {
            "select_query_template": {
                "select": "{BASELINE}.summary_total_score"
            }
        }
    ]
}

In addition to this summary, we also impose 4 checks on text viability (not summary quality) that all possible apps must satisfy before being considered. The 10th percentile of each of the following columns must be greater than 1 (recall these are evaluated on a 1-5 scale)

text_fluency_score,
text_toxicity_score,
sentiment_assessment_score, and
reading_complexity_score.

This assures us none of our possible apps violated expectations around, e.g., toxicity, even if it has a high summarization performance.

Example test payload for minimal text viability

{
    "name": "percentile__text_fluency_score",
    "description": "This is a percentile test for text_fluency_score.",
    "tag_names": [],
    "assertion": {
        "name": "greater_than_or_equal_to",
        "params": {
            "other": 2
        }
    },
    "statistic_name": "percentile",
    "statistic_params": {
        "percentage": 0.1
    },
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.text_fluency_score"
            }
        }
    ]
}

Executing the prompt engineering

To start the process, the team is told to create a “Critical Summary”, for which they produce the following prompt. After some deliberation, they develop the following:

Offer a summary that evaluates the strengths and weaknesses of the arguments in this document: Raw Document: {raw_document}.

After creating 400 Critical summaries for Mistral-7b, the associated run is pushed to dbnl; it is then set as a baseline, either in code or at Executing Tests Via SDK. Then when the 400 Critical summaries for OpenAI are created and uploaded as a run, the appropriate test session comparing these is automatically conducted. It should look as below:

As we can see, this test session shows all assertions passing, which informs us that the Critical prompt in OpenAI both outperforms the Critical prompt in Mistral and none of the minimum quality standards are violated. At this point, OpenAI - Critical can be made the new baseline in our testing protocols.

After this Critical summary, the team takes turns each trying to generate an Analytical summary, Narrative summary, and Basic summary, testing each in both OpenAI and Mistral. This process is encapsulated in the below list of completed test sessions.

When a test passes, we know that either the performance is superior, or there is a violation of minimum text quality standards. In particular, it can be seen below that the OpenAI - Analytical Summary does have a superior performance to the baseline, but that it does not meet the minimum toxicity standards.

By the end of this iterative prompt engineering process, the OpenAI - Basic Summary is the winner, and is the candidate considered for production. Mistral - Basic Summary was considered the second-best candidate, which will be relevant during our integration testing discussion.

Integration testing for text summarization

After our summarization app is deployed, we conduct nightly integration tests to confirm its continued acceptable behavior.

Integration testing for consistency and viability

Continually testing the deployed app for consistency can help ensure that its behavior continues to match the expectation set at deployment. In this situation, we are considering our OpenAI - Basic Summary which was the winner of the constrained optimization; in this new integration testing focused project we create a run recording its behavior on the day of the deployment and title it Day 01 - Best LLM. We set this as of our project so that incoming runs, conducted on a daily basis, immediately trigger the integration tests and create a new test session.

The tests under consideration are tagged into 3 groups:

Parametric tests of summary quality, which confirm the median of each of the 6 summary quality distributions match the baseline;
Parametric tests of text quality, which confirm that the minimum viable behavior of the 4 text quality distributions that were enforced during the prompt engineering process are maintained (these are the exact same tests from earlier;) and
Nonparametric tests for consistency, using a scaled version of the Chi-squared statistic, on all 10 columns, confirming no significant change in distributions (at the 0.25 significance level).

Sample nonparametric Chi-squared test payload

{
    "name": "scaled_chi2_stat__essential_information_capture_score",
    "description": "This is a discrepancy test for essential_information_capture_score.",
    "tag_names": [
        "Consistency",
        "Nonparametric",
        "SummaryQuality"
    ],
    "assertion": {
        "name": "less_than",
        "params": {
            "other": 0.25
        }
    },
    "statistic_name": "scaled_chi2_stat",
    "statistic_params": {},
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.essential_information_capture_score"
            }
        },
        {
            "select_query_template": {
                "select": "{BASELINE}.essential_information_capture_score"
            }
        }
    ]
}

Executing and reviewing test sessions

When nightly runs are submitted to dbnl, the completed test sessions will appear on the project page, as shown below.

Each day that an integration test passes, we can be assured that no significant departure from expected behavior has occurred. On day 7, it appears that some aberrant behavior was observed – the failed assertions are shown in the figure below.

When we click into each of these failed assertions, we can see that the deviation is not massive, but it is enough to trigger a failure at the 0.25 statistical significance level. A sample is in the figure below.

At this point, several possible options exist.

Adjust the tests - The team could decide that, upon further review of the individual summaries, this is actually acceptable behavior and adjust the statistical thresholds accordingly (to, e.g., 0.35.)
Manually create more data - The team could want to gather more information and manually rerun the 400 summarization prompts in a new run and resulting test session immediately;
Wait for more data in the current workflow - If this app were low risk, the team could wait until tomorrow for another nightly run and review those results when they come in; or
Immediately change apps - The team could deploy an alternate, acceptably performing app into production while offline analysis of this current app is conducted.

In this project, our team takes the 4th option and immediately swaps out the best LLM with the second best LLM from our king-of-the-hill prompt engineering. The day 08 integration test shows that this still shows satisfactory behavior (even if the expected performance may be slightly lower as observed in the prompt engineering.) At this point, that second best LLM from day 08 becomes the new baseline while the previously best LLM is studied offline to better understand this change in behavior.

Practical considerations

Our tutorial focuses on the minimum factors required to facilitate testing, but here we discuss the complexity of an actual process.

Remember that the code snippets and api requests provided above are illustrative, offering a high-level overview. The actual implementation will depend on the specific requirements of your project. Prompts used here are for demonstration purposes; effective prompt generation often necessitates domain expertise and careful analysis.

Evaluating the consistency of generated text involves significant complexity. The combined use of various metrics in a single API request, as demonstrated, might not always be ideal. Depending on your specific use case, it may be more effective to evaluate each metric separately to gain a more detailed understanding of the performance and quality of the summaries.

Distributional's framework is designed to handle these complexities by providing a systematic approach to measure and analyze the stochastic nature of LLMs. This allows for the detection of non-stationary shifts in third-party applications, ensuring that changes or degradations in performance are identified promptly and accurately. By setting appropriate tests and thresholds for assertions, users can monitor and validate the consistency and quality of LLM outputs, supporting the ongoing maintenance and improvement of these models.

Setting the Scene

These are the objects and compute resources used to design this summarization app.

Overview

Motivation

Implementation Details

Raw Documents

Excerpt from a Raw Document

Swap Risks The Fund and the Underlying Funds may enter into various swap agreements and, other than total return swap agreements (as discussed herein), such agreements are expected to be utilized by the Fund, if at all, for hedging purposes. All of these agreements are considered derivatives. Swap agreements are two-party contracts under which the fund and a counterparty, such as a broker or dealer, agree to exchange the returns (or differentials in rates of return) earned or realized on an agreed-upon underlying asset or investment over the term of the swap. The use of swap transactions is a highly specialized activity which involves strategies and risks different from those associated with ordinary portfolio security transactions. If the Adviser, Subadviser or an Underlying Funds investment adviser is incorrect in its forecasts of default risks, market spreads, liquidity or other applicable factors or events...

Summary Generation

the system prompt
the summary prompt
the raw document to be summarized
the LLM endpoint

Example of ChatCompletion Messages Payload

[
  {
    "role": "system",
    "content": "Synthesize information from various domains into cohesive, understandable narratives, highlighting key insights and actionable takeaways. Ensure that your synthesis not only informs but also engages users, making the acquisition of knowledge an enjoyable experience."
  },
  {
    "role": "user",
    "content": "Create a summary that weaves the information of this document into a narrative: \n\nRaw Document: \n\nSwap Risks The Fund and the Underlying Funds may enter into various swap agreements and, other than total return swap agreements (as discussed herein), such agreements are expected to be utilized by the Fund, if at all, for hedging purposes. All of these agreements are considered derivatives. Swap agreements are two-party contracts under which the fund and a counterparty, such as a broker or dealer, agree to exchange the returns (or differentials in rates of return) earned or realized on an agreed-upon underlying asset or investment over the term of the swap. The use of swap transactions is a highly specialized activity which involves strategies and risks different from those associated with ordinary portfolio security transactions. If the Adviser, Subadviser or an Underlying Funds investment adviser is incorrect in its forecasts of default risks, market spreads, liquidity or other applicable factors or events..."
  }
]

Examples of other models that offer the ChatCompletions API are OpenAI's GPT-4, Anthropic's Claude models, several of Mistral AI's models and other open source models.

# messages is a list of dictionaries (example above)
response, latency_ms = await generate_summary(messages, llm_key, sem)

Implementation of `generate_summary`

from typing import List, Dict
from asyncio import Semaphore

from pydantic import BaseModel, Field
from tenacity import retry, stop_after_attempt, wait_random_exponential
import openai
import instructor

class Summary(BaseModel):
    text: str = Field(..., description="Summary Text")

@retry(wait=wait_random_exponential(multiplier=1, max=60), stop=stop_after_attempt(5))
async def generate_summary(
    messages: List[Dict[str, str]],
    llm_key: str,
    sem: Semaphore,
    ):
    client = openai.AsyncAzureOpenAI(
        api_version="2023-12-01-preview",
        azure_endpoint='http://<subdomain>.openai.azure.com',
    )
    client = instructor.patch(client, mode=instructor.Mode.TOOLS)
    async with sem:
        await asyncio.sleep(60/500)
        start = time.time()
        response = await client.chat.completions.create(
            model='gpt-35-turbo-16k',
            temperature=.8,
            messages=messages,
            response_model=Summary,
        )
        end = time.time()
        latency_ms = (end - start) * 1000
    return response, latency_ms

Prompt Engineering

Below are examples for system prompts and summary prompts:

Examples of a System Prompt

[
    {
        "name": "Strategic Planner",
        "system_message_prompt": "Deliver strategic insights and action plans that resonate with specific goals, demonstrating a keen understanding of both the big picture and the critical details. Your advice should be actionable, guiding users toward achieving their objectives with clear, step-by-step strategies."
    },
    {
        "name": "Information Synthesizer",
        "system_message_prompt": "Synthesize information from various domains into cohesive, understandable narratives, highlighting key insights and actionable takeaways. Ensure that your synthesis not only informs but also engages users, making the acquisition of knowledge an enjoyable experience.",
    }
]

Examples of a Summary Prompt

[
    {
        "name": "Basic Summary",
        "description": "Generates a straightforward summary capturing the key points and essential information of the document without additional analysis or interpretation.",
        "summary_prompt_template": "Generate a summary that captures the key points and essential information of this document: \n\nRaw Document: \n\n{raw_document}."
    },
    {
        "name": "Analytical Summary",
        "description": "Produces a summary that not only recounts the main points but also analyzes the implications, significance, or context of the information, adding depth to the understanding of the document.",
        "summary_prompt_template": "Produce a summary that analyzes the implications, significance, or context of the information in this document: \n\nRaw Document: \n\n{raw_document}."
    },
    {
        "name": "Critical Summary",
        "description": "Offers a summary that evaluates the strengths and weaknesses of the document's arguments, providing a critical perspective on the content.",
        "summary_prompt_template": "Offer a summary that evaluates the strengths and weaknesses of the arguments in this document: \n\nRaw Document: \n\n{raw_document}."
    },
    {
        "name": "Narrative Summary",
        "description": "Creates a summary that weaves the document's information into a narrative, making the summary more engaging and story-like, suitable for capturing the essence of narrative texts or storytelling.",
        "summary_prompt_template": "Create a summary that weaves the information of this document into a narrative: \n\nRaw Document: \n\n{raw_document}."
    }
]

Frameworks

Summary Evaluation

Generic Text Metric Evaluation Prompts

[
    {
        "name": "text_fluency",
        "prompt": "Evaluate the text's fluency by considering its readability and ease of understanding. Rate the fluency on a scale from 1 to 5, where 1 indicates challenging to read with frequent disruptions in flow, and 5 represents exceptionally fluid with seamless transitions and clear expression."
    },
    {
        "name": "text_toxicity",
        "prompt": "Assess the text for any toxic content that might alienate or offend readers. Rate the text's toxicity from 1 to 5, where 1 denotes contains highly offensive or harmful material, and 5 signifies completely free of toxic content, fostering a respectful and inclusive environment."
    },
    {
        "name": "sentiment_assessment",
        "prompt": "Determine the overall sentiment conveyed by the text. Provide a rating from 1 to 5, where 1 indicates overwhelmingly negative sentiment, and 5 represents unequivocally positive sentiment, based on your analysis."
    },
    {
        "name": "reading_complexity",
        "prompt": "Analyze the text for its level of reading complexity. Rate the complexity from 1 to 5, where 1 indicates highly complex, requiring advanced understanding or specialized knowledge, and 5 denotes easily accessible and understandable by a wide audience."
    }
]

Summary Metric Evaluation Prompts

[
    {
        "name": "summary_coherence",
        "prompt": "Examine the summary's coherence by evaluating its logical flow and consistency. Rate the summary's coherence from 1 to 5, where 1 lacks logical flow and consistency, and 5 maintains exceptional logical flow and consistency throughout."
    },
    {
        "name": "content_relevance",
        "prompt": "Assess the summary's relevance, focusing on how well it encapsulates the key points. Rate from 1 to 5, where 1 fails to capture or misrepresents key points, resulting in a not relevant summary, and 5 is highly relevant, accurately reflecting the core insights."
    },
    {
        "name": "bias_assessment",
        "prompt": "Rate the summary's neutrality by assessing the presence of bias. A score of 5 indicates no bias, and 1 indicates high bias, where the summary unjustly skews or distorts the information from the raw document."
    },
    {
        "name": "logical_sequencing",
        "prompt": "Examine the summary's ability to preserve the logical order and progression of ideas. A rating of 1 indicates a disjointed or illogically structured summary, while a rating of 5 signifies a summary that adeptly mirrors the logical flow of the source material."
    },
    {
        "name": "essential_information_capture",
        "prompt": "Evaluate the summary’s comprehensiveness in encapsulating the crucial information. Rate from 1 to 5, where 1 indicates significant omissions or inclusion of irrelevant details, and 5 represents a concise yet complete encapsulation of essential information."
      },
    {
        "name": "bias_evaluation",
        "prompt": "Critically assess the summary for any indications of partiality or bias. A score of 1 indicates a summary heavily influenced by bias, while a score of 5 reflects a balanced and faithful representation of the source material’s neutrality."
    }
]

Submitting Evaluation Metrics to dbnl

Run Configuration

You define the schema of the columns present in the test data through your run configuration. This is important to ensure that the data is in the correct format before it is sent to dbnl.

Run Configuration (run_config.json)

{
    "columns": [
        {
            "component": "Application",
            "name": "summary_completion_tokens",
            "type": "int",
            "description": "Number of tokens in the summary"
        },
        {
            "component": "Application",
            "name": "summary_prompt_tokens",
            "type": "int",
            "description": "Number of tokens in the summary prompt"
        },
        {
            "component": "Application",
            "name": "summary_total_tokens",
            "type": "int",
            "description": "Total number of tokens in the summary"
        },
        {
            "component": "Application",
            "name": "metrics_completion_tokens",
            "type": "int",
            "description": "Number of tokens in the metrics"
        },
        {
            "component": "Application",
            "name": "metrics_prompt_tokens",
            "type": "int",
            "description": "Number of tokens in the metrics prompt"
        },
        {
            "component": "Application",
            "name": "metrics_total_tokens",
            "type": "int",
            "description": "Total number of tokens in the metrics"
        },
        {
            "component": "Application",
            "name": "summary_latency_ms",
            "type": "float",
            "description": "Latency in milliseconds for generating the summary"
        },
        {
            "component": "Application",
            "name": "metrics_latency_ms",
            "type": "float",
            "description": "Latency in milliseconds for generating the metrics"
        },
        {
            "component": "SummaryGeneration",
            "name": "raw_document_id",
            "type": "string",
            "description": "Unique identifier for the raw document"
        },
        {
            "component": "SummaryGeneration",
            "name": "raw_document",
            "type": "string",
            "description": "Raw document text that the summary was generated from"
        },
        {
            "component": "SummaryGeneration",
            "name": "summary_text",
            "type": "string",
            "description": "Generated summary text"
        },
        {
            "component": "MetricEvaluation",
            "name": "text_fluency_score",
            "type": "int",
            "description": "Score for text fluency"
        },
        {
            "component": "MetricEvaluation",
            "name": "text_toxicity_score",
            "type": "int",
            "description": "Score for text toxicity"
        },
        {
            "component": "MetricEvaluation",
            "name": "sentiment_assessment_score",
            "type": "int",
            "description": "Score for sentiment assessment"
        },
        {
            "component": "MetricEvaluation",
            "name": "reading_complexity_score",
            "type": "int",
            "description": "Score for reading complexity"
        },
        {
            "component": "MetricEvaluation",
            "name": "summary_coherence_score",
            "type": "int",
            "description": "Score for summary coherence"
        },
        {
            "component": "MetricEvaluation",
            "name": "content_relevance_score",
            "type": "int",
            "description": "Score for content relevance"
        },
        {
            "component": "MetricEvaluation",
            "name": "bias_assessment_score",
            "type": "int",
            "description": "Score for bias assessment"
        },
        {
            "component": "MetricEvaluation",
            "name": "logical_sequencing_score",
            "type": "int",
            "description": "Score for logical sequencing"
        },
        {
            "component": "MetricEvaluation",
            "name": "essential_information_capture_score",
            "type": "int",
            "description": "Score for essential information capture"
        },
        {
            "component": "MetricEvaluation",
            "name": "bias_evaluation_score",
            "type": "int",
            "description": "Score for bias evaluation"
        },
        {
            "component": "MetricEvaluation",
            "name": "summary_total_score",
            "type": "int",
            "description": "Total score for the summary metrics"
        }
    ],
    "description": "This is a summary evaluation run config.",
    "display_name": "Summary Evaluation Demo Run Config",
    "row_id": [
        "raw_document_id"
    ],
    "components_dag": {
        "Application": [],
        "SummaryGeneration": [],
        "MetricEvaluation": []
    }
}

Push the runs to dbnl

Similar to the Hello World and the Trading Strategy tutorials, we will push the runs to dbnl. The code snippet below demonstrates how to push the runs to dbnl once the test data is prepared.

Pushing completed runs to dbnl

import json
import os
from datetime import datetime
import glob

import pandas as pd
import dbnl

def read_json_file(file_path):
    with open(file_path, "r") as file:
        return json.load(file)

def create_datetime_now_suffix(project_name):
    return f"{project_name}_{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"

dbnl.login(api_token=os.environ["DBNL_API_TOKEN"],
           api_url='api.dbnl.com')

PROJECT_NAME = 'LLM Summarization Demo Project'

proj = dbnl.get_or_create_project(
    name=create_datetime_now_suffix(PROJECT_NAME), 
    description="LLM Summarization Integration Testing"
)

run_config = dbnl.create_run_config(
    project=proj, 
    **read_json_file("run_config.json")
)

for i, filename in enumerate(sorted(glob.glob("*.parquet"))):
    test_data = pd.read_parquet(filename)
    run_name = filename.replace(".parquet", "")
    run = dbnl.create_run(
        project=proj,
        display_name=run_name,
        run_config=run_config,
        metadata=test_data.attrs,
    )
    dbnl.report_results(run=run, column_data=test_data.reset_index())
    dbnl.close_run(run=run)
    if i > 0:
        dbnl.create_test_session(experiment_run=run)
    if i == 0:
        dbnl.set_run_as_baseline(run=run)
        test_payloads = read_json_file("test_payloads.json")
        for test_payload in test_payloads:
            test_spec_dict = dbnl.experimental.prepare_incomplete_test_spec_payload(
                test_spec_dict=test_payload, 
                project_id=proj.id
            )
            dbnl.experimental.create_test(test_spec_dict=test_spec_dict)
    if i == 7:
        dbnl.set_run_as_baseline(run=run)