LLM-as-judge and Embedding Metrics

A common strategy for evaluating unstructured text application is to use other LLMs and text embedding models to drive metrics of interest.

Supported LLM and model services

The LLM-as-judge text metrics in dbnl.eval support OpenAI, Azure OpenAI and any other third-party LLM / embedding model provider that is compatible with the OpenAI python client. Specifically, third-party endpoints should (mostly) adhere to the schema of:

The following examples show how to initialize an llm_eval_client and an eval_embedding_client under different providers.

OpenAI

from openai import OpenAI
from dbnl.eval.llm import OpenAILLMClient
from dbnl.eval.embedding_clients import OpenAIEmbeddingClient

# create client for LLM-as-judge metrics
base_oai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
eval_llm_client = OpenAILLMClient.from_existing_client(
    base_oai_client, llm_model="gpt-3.5-turbo-0125"
)

embd_client = OpenAIEmbeddingClient.from_existing_client(
    base_oai_client, embedding_model="text-embedding-ada-002"
)

Azure OpenAI

from openai import AzureOpenAI
from dbnl.eval.llm import AzureOpenAILLMClient
from dbnl.eval.embedding_clients import AzureOpenAIEmbeddingClient

base_azure_oai_client = AzureOpenAI(
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version=os.environ["OPENAI_API_VERSION"], # eg 2023-12-01-preview
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"] # eg https://resource-name.openai.azure.com
)
eval_llm_client = AzureOpenAILLMClient.from_existing_client(
    base_azure_oai_client, llm_model="gpt-35-turbo-16k"
)
embd_client = AzureOpenAIEmbeddingClient.from_existing_client(
    base_azure_oai_client, embedding_model="text-embedding-ada-002"
)

TogetherAI (or other OpenAI compatible service / endpoints)

from openai import OpenAI
from dbnl.eval.llm import OpenAILLMClient
base_oai_client = OpenAI(
    api_key=os.environ["TOGETHERAI_API_KEY"],
    base_url="https://api.together.xyz/v1",
)

eval_llm_client = OpenAILLMClient.from_existing_client(
    base_oai_client, llm_model='meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo'
)

Missing Metric Values

It is possible for some of the LLM-as-judge metrics to occasionally return values that are unable to be parsed. These metrics values will surface as None

Distributional is able to accept dataframes including None values. The platform will intelligently filter them when applicable.

Throughput and Rate Limits

LLM service providers often impose request rate limits and token throughput caps. Some example errors that one might encounter are shown below:

{'code': '429', 'message': 'Requests to the Embeddings_Create Operation under 
  Azure OpenAI API version XXXX have exceeded call rate limit of your current 
  OpenAI pricing tier. Please retry after 86400 seconds. 
  Please go here: https://aka.ms/oai/quotaincrease if you would 
  like to further increase the default rate limit.'}
{'message': 'You have been rate limited. Your rate limit is YYY queries per
minute. Please navigate to https://www.together.ai/forms/rate-limit-increase 
to request a rate limit increase.', 'type': 'credit_limit', 
'param': None, 'code': None}
{'message': 'Rate limit reached for gpt-4 in organization XXXX on 
tokens per min (TPM): Limit WWWWW, Used YYYY, Requested ZZZZ. 
Please try again in 1.866s. Visit https://platform.openai.com/account/rate-limits 
to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}

In the event you experience these errors, please work with your LLM service provider to adjust your limits. Additionally, feel free to reach out to Distributional support with the issue you are seeing.

Last updated

Was this helpful?