LogoLogo
AboutBlogLaunch app ↗
v0.20.x
v0.20.x
  • Introduction to AI Testing
  • Welcome to Distributional
  • Motivation
  • What is AI Testing?
  • Stages in the AI Software Development Lifecycle
    • Components of AI Testing
  • Distributional Testing
  • Getting Access to Distributional
  • Learning about Distributional
    • The Distributional Framework
    • Defining Tests in Distributional
      • Automated Production test creation & execution
      • Knowledge-based test creation
      • Comprehensive testing with Distributional
    • Reviewing Test Sessions and Runs in Distributional
      • Reviewing and recalibrating automated Production tests
      • Insights surfaced elsewhere on Distributional
      • Notifications
    • Data in Distributional
      • The flow of data
      • Components and the DAG for root cause analysis
      • Uploading data to Distributional
      • Living in your VPC
  • Using Distributional
    • Getting Started
    • Access
      • Organization and Namespaces
      • Users and Permissions
      • Tokens
    • Data
      • Data Objects
      • Run-Level Data
      • Data Storage Integrations
      • Data Access Controls
    • Testing
      • Creating Tests
        • Test Page
        • Test Drawer Through Shortcuts
        • Test Templates
        • SDK
      • Defining Assertions
      • Production Testing
        • Auto-Test Generation
        • Recalibration
        • Notable Results
        • Dynamic Baseline
      • Testing Strategies
        • Test That a Given Distribution Has Certain Properties
        • Test That Distributions Have the Same Statistics
        • Test That Columns Are Similarly Distributed
        • Test That Specific Results Have Matching Behavior
        • Test That Distributions Are Not the Same
      • Executing Tests
        • Manually Running Tests Via UI
        • Executing Tests Via SDK
      • Reviewing Tests
      • Using Filters
        • Filters in the Compare Page
        • Filters in Tests
    • Python SDK
      • Quick Start
      • Functions
        • login
        • Project
          • create_project
          • copy_project
          • export_project_as_json
          • get_project
          • get_or_create_project
          • import_project_from_json
        • Run Config
          • create_run_config
          • get_latest_run_config
          • get_run_config
          • get_run_config_from_latest_run
        • Run Results
          • get_column_results
          • get_scalar_results
          • get_results
          • report_column_results
          • report_scalar_results
          • report_results
        • Run
          • close_run
          • create_run
          • get_run
          • report_run_with_results
        • Baseline
          • create_run_query
          • get_run_query
          • set_run_as_baseline
          • set_run_query_as_baseline
        • Test Session
          • create_test_session
      • Objects
        • Project
        • RunConfig
        • Run
        • RunQuery
        • TestSession
        • TestRecalibrationSession
        • TestGenerationSession
        • ResultData
      • Experimental Functions
        • create_test
        • get_tests
        • get_test_sessions
        • wait_for_test_session
        • get_or_create_tag
        • prepare_incomplete_test_spec_payload
        • create_test_recalibration_session
        • wait_for_test_recalibration_session
        • create_test_generation_session
        • wait_for_test_generation_session
      • Eval Module
        • Quick Start
        • Application Metric Sets
        • How-To / FAQ
        • LLM-as-judge and Embedding Metrics
        • RAG / Question Answer Example
        • Eval Module Functions
          • Index of functions
          • eval
          • eval.metrics
    • Notifications
    • Release Notes
  • Tutorials
    • Instructions
    • Hello World (Sentiment Classifier)
    • Trading Strategy
    • LLM Text Summarization
      • Setting the Scene
      • Prompt Engineering
      • Integration testing for text summarization
      • Practical considerations
Powered by GitBook

© 2025 Distributional, Inc. All Rights Reserved.

On this page
  • Supported LLM and model services
  • OpenAI
  • Azure OpenAI
  • TogetherAI (or other OpenAI compatible service / endpoints)
  • Missing Metric Values
  • Throughput and Rate Limits

Was this helpful?

Export as PDF
  1. Using Distributional
  2. Python SDK
  3. Eval Module

LLM-as-judge and Embedding Metrics

PreviousHow-To / FAQNextRAG / Question Answer Example

Was this helpful?

A common strategy for evaluating unstructured text application is to use other LLMs and text embedding models to drive metrics of interest.

Supported LLM and model services

The LLM-as-judge in dbnl.eval support OpenAI, Azure OpenAI and any other third-party LLM / embedding model provider that is compatible with the OpenAI python client. Specifically, third-party endpoints should (mostly) adhere to the schema of:

  • endpoint for LLMs

  • endpoint for embedding models

The following examples show how to initialize an llm_eval_client and an eval_embedding_client under different providers.

OpenAI

from openai import OpenAI
from dbnl.eval.llm import OpenAILLMClient
from dbnl.eval.embedding_clients import OpenAIEmbeddingClient

# create client for LLM-as-judge metrics
base_oai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
eval_llm_client = OpenAILLMClient.from_existing_client(
    base_oai_client, llm_model="gpt-3.5-turbo-0125"
)

embd_client = OpenAIEmbeddingClient.from_existing_client(
    base_oai_client, embedding_model="text-embedding-ada-002"
)

Azure OpenAI

from openai import AzureOpenAI
from dbnl.eval.llm import AzureOpenAILLMClient
from dbnl.eval.embedding_clients import AzureOpenAIEmbeddingClient

base_azure_oai_client = AzureOpenAI(
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version=os.environ["OPENAI_API_VERSION"], # eg 2023-12-01-preview
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"] # eg https://resource-name.openai.azure.com
)
eval_llm_client = AzureOpenAILLMClient.from_existing_client(
    base_azure_oai_client, llm_model="gpt-35-turbo-16k"
)
embd_client = AzureOpenAIEmbeddingClient.from_existing_client(
    base_azure_oai_client, embedding_model="text-embedding-ada-002"
)

TogetherAI (or other OpenAI compatible service / endpoints)

from openai import OpenAI
from dbnl.eval.llm import OpenAILLMClient
base_oai_client = OpenAI(
    api_key=os.environ["TOGETHERAI_API_KEY"],
    base_url="https://api.together.xyz/v1",
)

eval_llm_client = OpenAILLMClient.from_existing_client(
    base_oai_client, llm_model='meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo'
)

Missing Metric Values

It is possible for some of the LLM-as-judge metrics to occasionally return values that are unable to be parsed. These metrics values will surface as None

Distributional is able to accept dataframes including None values. The platform will intelligently filter them when applicable.

Throughput and Rate Limits

LLM service providers often impose request rate limits and token throughput caps. Some example errors that one might encounter are shown below:

{'code': '429', 'message': 'Requests to the Embeddings_Create Operation under 
  Azure OpenAI API version XXXX have exceeded call rate limit of your current 
  OpenAI pricing tier. Please retry after 86400 seconds. 
  Please go here: https://aka.ms/oai/quotaincrease if you would 
  like to further increase the default rate limit.'}
{'message': 'You have been rate limited. Your rate limit is YYY queries per
minute. Please navigate to https://www.together.ai/forms/rate-limit-increase 
to request a rate limit increase.', 'type': 'credit_limit', 
'param': None, 'code': None}
{'message': 'Rate limit reached for gpt-4 in organization XXXX on 
tokens per min (TPM): Limit WWWWW, Used YYYY, Requested ZZZZ. 
Please try again in 1.866s. Visit https://platform.openai.com/account/rate-limits 
to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}

In the event you experience these errors, please work with your LLM service provider to adjust your limits. Additionally, feel free to reach out to Distributional support with the issue you are seeing.

text metrics
v1/chat/completions
v1/embeddings