LogoLogo
AboutBlogLaunch app ↗
v0.20.x
v0.20.x
  • Introduction to AI Testing
  • Welcome to Distributional
  • Motivation
  • What is AI Testing?
  • Stages in the AI Software Development Lifecycle
    • Components of AI Testing
  • Distributional Testing
  • Getting Access to Distributional
  • Learning about Distributional
    • The Distributional Framework
    • Defining Tests in Distributional
      • Automated Production test creation & execution
      • Knowledge-based test creation
      • Comprehensive testing with Distributional
    • Reviewing Test Sessions and Runs in Distributional
      • Reviewing and recalibrating automated Production tests
      • Insights surfaced elsewhere on Distributional
      • Notifications
    • Data in Distributional
      • The flow of data
      • Components and the DAG for root cause analysis
      • Uploading data to Distributional
      • Living in your VPC
  • Using Distributional
    • Getting Started
    • Access
      • Organization and Namespaces
      • Users and Permissions
      • Tokens
    • Data
      • Data Objects
      • Run-Level Data
      • Data Storage Integrations
      • Data Access Controls
    • Testing
      • Creating Tests
        • Test Page
        • Test Drawer Through Shortcuts
        • Test Templates
        • SDK
      • Defining Assertions
      • Production Testing
        • Auto-Test Generation
        • Recalibration
        • Notable Results
        • Dynamic Baseline
      • Testing Strategies
        • Test That a Given Distribution Has Certain Properties
        • Test That Distributions Have the Same Statistics
        • Test That Columns Are Similarly Distributed
        • Test That Specific Results Have Matching Behavior
        • Test That Distributions Are Not the Same
      • Executing Tests
        • Manually Running Tests Via UI
        • Executing Tests Via SDK
      • Reviewing Tests
      • Using Filters
        • Filters in the Compare Page
        • Filters in Tests
    • Python SDK
      • Quick Start
      • Functions
        • login
        • Project
          • create_project
          • copy_project
          • export_project_as_json
          • get_project
          • get_or_create_project
          • import_project_from_json
        • Run Config
          • create_run_config
          • get_latest_run_config
          • get_run_config
          • get_run_config_from_latest_run
        • Run Results
          • get_column_results
          • get_scalar_results
          • get_results
          • report_column_results
          • report_scalar_results
          • report_results
        • Run
          • close_run
          • create_run
          • get_run
          • report_run_with_results
        • Baseline
          • create_run_query
          • get_run_query
          • set_run_as_baseline
          • set_run_query_as_baseline
        • Test Session
          • create_test_session
      • Objects
        • Project
        • RunConfig
        • Run
        • RunQuery
        • TestSession
        • TestRecalibrationSession
        • TestGenerationSession
        • ResultData
      • Experimental Functions
        • create_test
        • get_tests
        • get_test_sessions
        • wait_for_test_session
        • get_or_create_tag
        • prepare_incomplete_test_spec_payload
        • create_test_recalibration_session
        • wait_for_test_recalibration_session
        • create_test_generation_session
        • wait_for_test_generation_session
      • Eval Module
        • Quick Start
        • Application Metric Sets
        • How-To / FAQ
        • LLM-as-judge and Embedding Metrics
        • RAG / Question Answer Example
        • Eval Module Functions
          • Index of functions
          • eval
          • eval.metrics
    • Notifications
    • Release Notes
  • Tutorials
    • Instructions
    • Hello World (Sentiment Classifier)
    • Trading Strategy
    • LLM Text Summarization
      • Setting the Scene
      • Prompt Engineering
      • Integration testing for text summarization
      • Practical considerations
Powered by GitBook

© 2025 Distributional, Inc. All Rights Reserved.

On this page
  • What if I do not have an LLM service to run LLM-as-judge metrics?
  • What if I do not have ground-truth available?
  • How do I create a custom LLM-as-judge metric?

Was this helpful?

Export as PDF
  1. Using Distributional
  2. Python SDK
  3. Eval Module

How-To / FAQ

What if I do not have an LLM service to run LLM-as-judge metrics?

No problem, just don’t include an eval_llm_client or an eval_embedding_client argument in the call(s) to the evaluation helpers. The helpers will automatically exclude any metrics that depend on them.

# BEFORE : default text metrics including those requiring target (ground_truth) and LLM-as-judge
text_metrics = dbnl.eval.metrics.text_metrics(
    prediction="prediction", target="ground_truth", eval_llm_client=oai_client
)

# AFTER : remove the eval_llm_client to exclude LLM-as-judge metrics
text_metrics = dbnl.eval.metrics.text_metrics(
    prediction="prediction", target="ground_truth"
)

aug_eval_df = evaluate(eval_df, text_metrics)

What if I do not have ground-truth available?

No problem. You can simply remove the target argument from the helper. The metric set helper will automatically exclude any metrics that depend on the target column being specified.

# BEFORE : default text metrics, including those requiring target (ground_truth) and LLM-as-judge
text_metrics = dbnl.eval.metrics.text_metrics(
    prediction="prediction", target="ground_truth", eval_llm_client=oai_client
)

# AFTER : remove the target to remove metrics that depend on that value being specified
text_metrics = dbnl.eval.metrics.text_metrics(
    prediction="prediction", eval_llm_client=oai_client
)

aug_eval_df = evaluate(eval_df, text_metrics)

There is an additional helper that can generate a list of generic metrics appropriate for “monitoring” unstructured text columns : text_monitor_metrics(). Simply provide a list of text column names and optionally an eval_llm_client for LLM-as-judge metrics.

# get text metrics for each column in list
monitor_metrics = dbnl.eval.metrics.text_monitor_metrics(
  ["prediction", "input"], eval_llm_client=oai_client
)

aug_eval_df = evaluate(eval_df, monitor_metrics)

How do I create a custom LLM-as-judge metric?

You can write your own LLM-as-judge metric that uses your custom prompt. The example below defines a custom LLM-as-judge metric and runs it on an example dataframe.

import dbnl
import os
import pandas as pd
from openai import OpenAI
from dbnl.eval.llm import OpenAILLMClient
from dbnl.eval import evaluate
from dbnl.eval.metrics.mlflow import MLFlowGenAIFromPromptEvaluationMetric
from dbnl.eval.metrics.metric import Metric
from dbnl.eval.llm.client import LLMClient

# 1. create client to power LLM-as-judge metrics
base_oai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
oai_client = OpenAILLMClient.from_existing_client(base_oai_client, llm_model="gpt-3.5-turbo-0125")

eval_df = pd.DataFrame(
    [
        { "prediction":"France has no capital",
          "ground_truth": "The capital of France is Paris",},
        { "prediction":"The capital of France is Toronto",
          "ground_truth": "The capital of France is Paris",},
        { "prediction":"Paris is the capital",
          "ground_truth": "The capital of France is Paris",},
    ] * 4
)

# 2. define a custom LLM-as-judge metric
def custom_text_similarity(prediction: str, target: str, eval_llm_client: LLMClient) -> Metric:
    custom_prompt_v0 = """
      Given the generated text : {prediction}, score the semantic similarity to the reference text : {target}. 

      Rate the semantic similarity from 1 (completely different meaning and facts between the generated and reference texts) to 5 (nearly the exact same semantic meaning and facts present in the generated and reference texts).

      Example output, make certain that 'score:' and 'justification:' text is present in output:
      score: 4
      justification: XYZ
    """
    
    return MLFlowGenAIFromPromptEvaluationMetric(
        name="custom_text_similarity",
        judge_prompt=custom_prompt_v0,
        prediction=prediction,
        target=target,
        eval_llm_client=eval_llm_client,
        version="v0",
    )

# 3. instantiate the custom LLM-as-judge metric
c_metric = custom_text_similarity(
  prediction='prediction', target='ground_truth', eval_llm_client=oai_client
)
# 4. run only the custom LLM-as-judge metric
aug_eval_df = evaluate(eval_df, [c_metric])

You can also write a metric that includes only the prediction column specified and reference only {prediction} in the custom prompt. An example is below:

def custom_text_simplicity(prediction: str, target: str, eval_llm_client: LLMClient) -> Metric:
    custom_prompt_v0 = """
      Given the generated text : {prediction}, score the text from 1 to 5 based on whether it is written in simple, easy to understand english 

      Rate the generated text from 5 (completely simple english, very commonly used words, easy to explain vocabulary) to 1 (complex english, uncommon words, difficult to explain vocabulary).

      Example output, make certain that 'score:' and 'justification:' text is present in output:
      score: 4
      justification: XYZ
    """
    
    return MLFlowGenAIFromPromptEvaluationMetric(
        name="custom_text_simplicity",
        judge_prompt=custom_prompt_v0,
        prediction=prediction,
        target=target,
        eval_llm_client=eval_llm_client,
        version="v0",
    )

PreviousApplication Metric SetsNextLLM-as-judge and Embedding Metrics

Was this helpful?