Setting the Scene

These are the objects and compute resources used to design this summarization app.

Overview

In this tutorial, we will demonstrate how to use dbnl to automatically evaluate the consistency of summarization output on a fixed set of documents. We will start by answering the question of why one would be concerned of the consistency of generated text. We will cover generating summaries from self-hosted and third party language model providers. We will review the trade-offs of using automatic metrics and then proceed to generate and store them. We will then demonstrate how to use dbnl in an Integration Test setting to regularly evaluate the consistency of the generated summaries. Finally, we will discuss some limitations of the approach and how to mitigate them.

Motivation

The primary motivation for evaluating the consistency of generated text is because LLMs are inherently stochastic, different prompts and Large Language Models (LLMs) yield different results, and even the same prompts and the same LLMs will also yield different results as LLMs are stochastic. This is especially apparent with third party LLM providers (e.g. OpenAI's GPT-4) where the underlying model changes, and the only way to evaluate the output is through the generated text.

Managing this complexity across a dozen or more metrics and is a difficult task. DBNL was especially designed to help manage this complexity and provide a stateful system of record for the granular, distributional evaluation of stochastic systems.

Implementation Details

We are going to walk through the main components of the system with some example code snippets. That is, the full dataflow from raw documents to summary generation to summary evaluation to submitting evaluation metrics to dbnl to integration testing is going to be covered, but the code snippets will be illustrative and not exhaustive.

Raw Documents

Suppose that there is a business use case to summarize financial documents from the SEC. For illustrative purposes, we constructed an example of raw documents from the SEC Financial Statements and Notes Data Sets website.

Excerpt from a Raw Document
Swap Risks The Fund and the Underlying Funds may enter into various swap agreements and, other than total return swap agreements (as discussed herein), such agreements are expected to be utilized by the Fund, if at all, for hedging purposes. All of these agreements are considered derivatives. Swap agreements are two-party contracts under which the fund and a counterparty, such as a broker or dealer, agree to exchange the returns (or differentials in rates of return) earned or realized on an agreed-upon underlying asset or investment over the term of the swap. The use of swap transactions is a highly specialized activity which involves strategies and risks different from those associated with ordinary portfolio security transactions. If the Adviser, Subadviser or an Underlying Funds investment adviser is incorrect in its forecasts of default risks, market spreads, liquidity or other applicable factors or events...

Summary Generation

We will use the ChatCompletion API standard to generate summaries daily. The ChatCompletion API is a simple API that can be used to generate summaries from a given prompt. Roughly speaking, what's necessary to generate a summary are four things:

  • the system prompt

  • the summary prompt

  • the raw document to be summarized

  • the LLM endpoint

The prompt used to generate the summary is a combination of the first three items above, and the endpoints used can be a self-hosted LLM or something available via OpenAI, Azure OpenAI, Amazon Bedrock, or any number of other startup LLM providers (Together AI). Bear in mind privacy and security concerns when using third party LLMs.

Example of ChatCompletion Messages Payload
[
  {
    "role": "system",
    "content": "Synthesize information from various domains into cohesive, understandable narratives, highlighting key insights and actionable takeaways. Ensure that your synthesis not only informs but also engages users, making the acquisition of knowledge an enjoyable experience."
  },
  {
    "role": "user",
    "content": "Create a summary that weaves the information of this document into a narrative: \n\nRaw Document: \n\nSwap Risks The Fund and the Underlying Funds may enter into various swap agreements and, other than total return swap agreements (as discussed herein), such agreements are expected to be utilized by the Fund, if at all, for hedging purposes. All of these agreements are considered derivatives. Swap agreements are two-party contracts under which the fund and a counterparty, such as a broker or dealer, agree to exchange the returns (or differentials in rates of return) earned or realized on an agreed-upon underlying asset or investment over the term of the swap. The use of swap transactions is a highly specialized activity which involves strategies and risks different from those associated with ordinary portfolio security transactions. If the Adviser, Subadviser or an Underlying Funds investment adviser is incorrect in its forecasts of default risks, market spreads, liquidity or other applicable factors or events..."
  }
]

Then we can use the following code snippet to generate a summary using GPT-3.5 Turbo 16k (the length of the context) hosted on Microsoft Azure. Note that this example uses the ChatCompletion API standard with function calling and is applied to other endpoints later on in this tutorial.

Examples of other models that offer the ChatCompletions API are OpenAI's GPT-4, Anthropic's Claude models, several of Mistral AI's models and other open source models.

# messages is a list of dictionaries (example above)
response, latency_ms = await generate_summary(messages, llm_key, sem)
Implementation of `generate_summary`
from typing import List, Dict
from asyncio import Semaphore

from pydantic import BaseModel, Field
from tenacity import retry, stop_after_attempt, wait_random_exponential
import openai
import instructor

class Summary(BaseModel):
    text: str = Field(..., description="Summary Text")

@retry(wait=wait_random_exponential(multiplier=1, max=60), stop=stop_after_attempt(5))
async def generate_summary(
    messages: List[Dict[str, str]],
    llm_key: str,
    sem: Semaphore,
    ):
    client = openai.AsyncAzureOpenAI(
        api_version="2023-12-01-preview",
        azure_endpoint='http://<subdomain>.openai.azure.com',
    )
    client = instructor.patch(client, mode=instructor.Mode.TOOLS)
    async with sem:
        await asyncio.sleep(60/500)
        start = time.time()
        response = await client.chat.completions.create(
            model='gpt-35-turbo-16k',
            temperature=.8,
            messages=messages,
            response_model=Summary,
        )
        end = time.time()
        latency_ms = (end - start) * 1000
    return response, latency_ms

Prompt Engineering

The prompt is the easiest part of the summary generation pipeline to alter. Selection of prompts should be more than a simple "vibe check" and should be based on an analysis that is omitted from this tutorial but will exist in our future tutorials.

Below are examples for system prompts and summary prompts:

Examples of a System Prompt
[
    {
        "name": "Strategic Planner",
        "system_message_prompt": "Deliver strategic insights and action plans that resonate with specific goals, demonstrating a keen understanding of both the big picture and the critical details. Your advice should be actionable, guiding users toward achieving their objectives with clear, step-by-step strategies."
    },
    {
        "name": "Information Synthesizer",
        "system_message_prompt": "Synthesize information from various domains into cohesive, understandable narratives, highlighting key insights and actionable takeaways. Ensure that your synthesis not only informs but also engages users, making the acquisition of knowledge an enjoyable experience.",
    }
]
Examples of a Summary Prompt
[
    {
        "name": "Basic Summary",
        "description": "Generates a straightforward summary capturing the key points and essential information of the document without additional analysis or interpretation.",
        "summary_prompt_template": "Generate a summary that captures the key points and essential information of this document: \n\nRaw Document: \n\n{raw_document}."
    },
    {
        "name": "Analytical Summary",
        "description": "Produces a summary that not only recounts the main points but also analyzes the implications, significance, or context of the information, adding depth to the understanding of the document.",
        "summary_prompt_template": "Produce a summary that analyzes the implications, significance, or context of the information in this document: \n\nRaw Document: \n\n{raw_document}."
    },
    {
        "name": "Critical Summary",
        "description": "Offers a summary that evaluates the strengths and weaknesses of the document's arguments, providing a critical perspective on the content.",
        "summary_prompt_template": "Offer a summary that evaluates the strengths and weaknesses of the arguments in this document: \n\nRaw Document: \n\n{raw_document}."
    },
    {
        "name": "Narrative Summary",
        "description": "Creates a summary that weaves the document's information into a narrative, making the summary more engaging and story-like, suitable for capturing the essence of narrative texts or storytelling.",
        "summary_prompt_template": "Create a summary that weaves the information of this document into a narrative: \n\nRaw Document: \n\n{raw_document}."
    }
]

Frameworks

There are over a dozen frameworks and libraries out there which offer automatic metrics and dataflow assistance for generating summaries and processing their evaluations. In this tutorial we generate some custom code, but dbnl can be used in conjunction with any of these frameworks.

Summary Evaluation

The next step is to evaluate the generated summary and we will use LLMs to do so. In fact, we will be asking an LLM to evaluate the summary generated by another LLM and use that as a proxy for human evaluation. This is a common practice in the field of summarization and is a good way to get a sense of the quality of the generated summary.

In our tutorial, we use both generic text metrics (which don't require any knowledge of the raw document and can be applied to any text) and summary metrics which assess the quality of the summary and require the raw document. All metrics are on a Likert scale, requesting a number between 1 and 5, where higher is "better".

We enforce that the API response falls into this by leveraging the pydantic and instructor libraries. Behind the scenes, it's just using the requests library to make the API call and we have an example of an api request below.

Generic Text Metric Evaluation Prompts
[
    {
        "name": "text_fluency",
        "prompt": "Evaluate the text's fluency by considering its readability and ease of understanding. Rate the fluency on a scale from 1 to 5, where 1 indicates challenging to read with frequent disruptions in flow, and 5 represents exceptionally fluid with seamless transitions and clear expression."
    },
    {
        "name": "text_toxicity",
        "prompt": "Assess the text for any toxic content that might alienate or offend readers. Rate the text's toxicity from 1 to 5, where 1 denotes contains highly offensive or harmful material, and 5 signifies completely free of toxic content, fostering a respectful and inclusive environment."
    },
    {
        "name": "sentiment_assessment",
        "prompt": "Determine the overall sentiment conveyed by the text. Provide a rating from 1 to 5, where 1 indicates overwhelmingly negative sentiment, and 5 represents unequivocally positive sentiment, based on your analysis."
    },
    {
        "name": "reading_complexity",
        "prompt": "Analyze the text for its level of reading complexity. Rate the complexity from 1 to 5, where 1 indicates highly complex, requiring advanced understanding or specialized knowledge, and 5 denotes easily accessible and understandable by a wide audience."
    }
]
Summary Metric Evaluation Prompts
[
    {
        "name": "summary_coherence",
        "prompt": "Examine the summary's coherence by evaluating its logical flow and consistency. Rate the summary's coherence from 1 to 5, where 1 lacks logical flow and consistency, and 5 maintains exceptional logical flow and consistency throughout."
    },
    {
        "name": "content_relevance",
        "prompt": "Assess the summary's relevance, focusing on how well it encapsulates the key points. Rate from 1 to 5, where 1 fails to capture or misrepresents key points, resulting in a not relevant summary, and 5 is highly relevant, accurately reflecting the core insights."
    },
    {
        "name": "bias_assessment",
        "prompt": "Rate the summary's neutrality by assessing the presence of bias. A score of 5 indicates no bias, and 1 indicates high bias, where the summary unjustly skews or distorts the information from the raw document."
    },
    {
        "name": "logical_sequencing",
        "prompt": "Examine the summary's ability to preserve the logical order and progression of ideas. A rating of 1 indicates a disjointed or illogically structured summary, while a rating of 5 signifies a summary that adeptly mirrors the logical flow of the source material."
    },
    {
        "name": "essential_information_capture",
        "prompt": "Evaluate the summary’s comprehensiveness in encapsulating the crucial information. Rate from 1 to 5, where 1 indicates significant omissions or inclusion of irrelevant details, and 5 represents a concise yet complete encapsulation of essential information."
      },
    {
        "name": "bias_evaluation",
        "prompt": "Critically assess the summary for any indications of partiality or bias. A score of 1 indicates a summary heavily influenced by bias, while a score of 5 reflects a balanced and faithful representation of the source material’s neutrality."
    }
]

Submitting Evaluation Metrics to dbnl

Run Configuration

You define the schema of the columns present in the test data through your run configuration. This is important to ensure that the data is in the correct format before it is sent to dbnl.

Notice we're also designing this to capture latency, in addition to all the metrics described above. This is important for understanding the performance of the system and can be used to detect performance change / degradation over time.

Run Configuration (run_config.json)
{
    "columns": [
        {
            "component": "Application",
            "name": "summary_completion_tokens",
            "type": "int",
            "description": "Number of tokens in the summary"
        },
        {
            "component": "Application",
            "name": "summary_prompt_tokens",
            "type": "int",
            "description": "Number of tokens in the summary prompt"
        },
        {
            "component": "Application",
            "name": "summary_total_tokens",
            "type": "int",
            "description": "Total number of tokens in the summary"
        },
        {
            "component": "Application",
            "name": "metrics_completion_tokens",
            "type": "int",
            "description": "Number of tokens in the metrics"
        },
        {
            "component": "Application",
            "name": "metrics_prompt_tokens",
            "type": "int",
            "description": "Number of tokens in the metrics prompt"
        },
        {
            "component": "Application",
            "name": "metrics_total_tokens",
            "type": "int",
            "description": "Total number of tokens in the metrics"
        },
        {
            "component": "Application",
            "name": "summary_latency_ms",
            "type": "float",
            "description": "Latency in milliseconds for generating the summary"
        },
        {
            "component": "Application",
            "name": "metrics_latency_ms",
            "type": "float",
            "description": "Latency in milliseconds for generating the metrics"
        },
        {
            "component": "SummaryGeneration",
            "name": "raw_document_id",
            "type": "string",
            "description": "Unique identifier for the raw document"
        },
        {
            "component": "SummaryGeneration",
            "name": "raw_document",
            "type": "string",
            "description": "Raw document text that the summary was generated from"
        },
        {
            "component": "SummaryGeneration",
            "name": "summary_text",
            "type": "string",
            "description": "Generated summary text"
        },
        {
            "component": "MetricEvaluation",
            "name": "text_fluency_score",
            "type": "int",
            "description": "Score for text fluency"
        },
        {
            "component": "MetricEvaluation",
            "name": "text_toxicity_score",
            "type": "int",
            "description": "Score for text toxicity"
        },
        {
            "component": "MetricEvaluation",
            "name": "sentiment_assessment_score",
            "type": "int",
            "description": "Score for sentiment assessment"
        },
        {
            "component": "MetricEvaluation",
            "name": "reading_complexity_score",
            "type": "int",
            "description": "Score for reading complexity"
        },
        {
            "component": "MetricEvaluation",
            "name": "summary_coherence_score",
            "type": "int",
            "description": "Score for summary coherence"
        },
        {
            "component": "MetricEvaluation",
            "name": "content_relevance_score",
            "type": "int",
            "description": "Score for content relevance"
        },
        {
            "component": "MetricEvaluation",
            "name": "bias_assessment_score",
            "type": "int",
            "description": "Score for bias assessment"
        },
        {
            "component": "MetricEvaluation",
            "name": "logical_sequencing_score",
            "type": "int",
            "description": "Score for logical sequencing"
        },
        {
            "component": "MetricEvaluation",
            "name": "essential_information_capture_score",
            "type": "int",
            "description": "Score for essential information capture"
        },
        {
            "component": "MetricEvaluation",
            "name": "bias_evaluation_score",
            "type": "int",
            "description": "Score for bias evaluation"
        },
        {
            "component": "MetricEvaluation",
            "name": "summary_total_score",
            "type": "int",
            "description": "Total score for the summary metrics"
        }
    ],
    "description": "This is a summary evaluation run config.",
    "display_name": "Summary Evaluation Demo Run Config",
    "row_id": [
        "raw_document_id"
    ],
    "components_dag": {
        "Application": [],
        "SummaryGeneration": [],
        "MetricEvaluation": []
    }
}

Push the runs to dbnl

Similar to the Hello World and the Trading Strategy tutorials, we will push the runs to dbnl. The code snippet below demonstrates how to push the runs to dbnl once the test data is prepared.

Pushing completed runs to dbnl
import json
import os
from datetime import datetime
import glob

import pandas as pd
import dbnl

def read_json_file(file_path):
    with open(file_path, "r") as file:
        return json.load(file)

def create_datetime_now_suffix(project_name):
    return f"{project_name}_{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"

dbnl.login(api_token=os.environ["DBNL_API_TOKEN"],
           api_url='api.dbnl.com')

PROJECT_NAME = 'LLM Summarization Demo Project'

proj = dbnl.get_or_create_project(
    name=create_datetime_now_suffix(PROJECT_NAME), 
    description="LLM Summarization Integration Testing"
)

run_config = dbnl.create_run_config(
    project=proj, 
    **read_json_file("run_config.json")
)

for i, filename in enumerate(sorted(glob.glob("*.parquet"))):
    test_data = pd.read_parquet(filename)
    run_name = filename.replace(".parquet", "")
    run = dbnl.create_run(
        project=proj,
        display_name=run_name,
        run_config=run_config,
        metadata=test_data.attrs,
    )
    dbnl.report_results(run=run, column_data=test_data.reset_index())
    dbnl.close_run(run=run)
    if i > 0:
        dbnl.create_test_session(experiment_run=run)
    if i == 0:
        dbnl.set_run_as_baseline(run=run)
        test_payloads = read_json_file("test_payloads.json")
        for test_payload in test_payloads:
            test_spec_dict = dbnl.experimental.prepare_incomplete_test_spec_payload(
                test_spec_dict=test_payload, 
                project_id=proj.id
            )
            dbnl.experimental.create_test(test_spec_dict=test_spec_dict)
    if i == 7:
        dbnl.set_run_as_baseline(run=run)

Was this helpful?