LLM-as-judge and Embedding Metrics

A common strategy for evaluating unstructured text application is to use other LLMs and text embedding models to drive metrics of interest.

Supported LLM and model services

The LLM-as-judge text metrics in dbnl.eval support OpenAI, Azure OpenAI and any other third-party LLM / embedding model provider that is compatible with the OpenAI python client. Specifically, third-party endpoints should (mostly) adhere to the schema of:

The following examples show how to initialize an llm_eval_client and an eval_embedding_client under different providers.

OpenAI

from openai import OpenAI
from dbnl.eval.llm import OpenAILLMClient
from dbnl.eval.embedding_clients import OpenAIEmbeddingClient

# create client for LLM-as-judge metrics
base_oai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
eval_llm_client = OpenAILLMClient.from_existing_client(
    base_oai_client, llm_model="gpt-3.5-turbo-0125"
)

embd_client = OpenAIEmbeddingClient.from_existing_client(
    base_oai_client, embedding_model="text-embedding-ada-002"
)

Azure OpenAI

TogetherAI (or other OpenAI compatible service / endpoints)

Missing Metric Values

It is possible for some of the LLM-as-judge metrics to occasionally return values that are unable to be parsed. These metrics values will surface as None

Distributional is able to accept dataframes including None values. The platform will intelligently filter them when applicable.

Throughput and Rate Limits

LLM service providers often impose request rate limits and token throughput caps. Some example errors that one might encounter are shown below:

In the event you experience these errors, please work with your LLM service provider to adjust your limits. Additionally, feel free to reach out to Distributional support with the issue you are seeing.

Was this helpful?