LLM-as-Judge Metric Templates

Pre-built templates to customize LLM-as-judge Metrics

Custom Metric Templates

Templates for creating entirely new LLM-as-Judge Metrics:

chevron-rightCustom Classifier Metrichashtag
  • Evaluation Prompt:

You are a classifier that classifies the given input according to predefined labels. Carefully read the reasoning for each label, then assign exactly one. Do not include any explanation or extra text.

## Input to be classified:
{your_column_name_here}

## Possible Labels:
<your_label_here>: <your reasoning here>
<your_label_here>: <your reasoning here>
chevron-rightCustom Scorer Metrichashtag
  • Evaluation Prompt:

You are an evaluator that assigns a score to the given the input, based on the reasoning defined below.

## Input to be scored:
{your_column_name_here}

## How to score:
<your reasoning here, make sure it only returns a score from [1, 2, 3, 4, 5]>

Default Metric Templates

Built in LLM-as-Judge Metrics that can be customized by the user:

chevron-righttopichashtag
  • Description: Classifies the conversation into a topic based on the input and output. This Metric is created after topics are automatically generated from the first 7 days of ingested data.

  • Type: Classifier

    • sages is crucial.

The following is a conversation between an AI assistant and a user:

<messages>
<message>user: {input}</message>
<message>assistant: {output}</message>
</messages>

# Task

Your job is to classify the conversation into one of the following topics.
Use both user and assistant messages in your decision.
Carefully consider each topic and choose the most appropriate one.
If you do not think the conversation is about any of the named topics, classify it as "other".

# List of topics

- plan romantic and active day outings in various cities
- recommend nearby locations for a streetcar, bus, or bike trip
- plan an itinerary for a bar foodie crawl
- plan a customized day trip for leisure activities
- plan bike routes with stops and transportation
- plan a museum-hopping itinerary in multiple cities
- plan a family outing to visit various locations using public transportation and walking
- assist in planning a day trip or romantic evening out in a city, providing recommendations for various attractions, transportation options, and reservation links
- provide nearby dining options
- provide dining or nightlife recommendations based on location and user preferences
- other
chevron-rightllm_answer_groundednesshashtag
  • Description: Given a list of Contexts and Answer, groundedness refers to the Answer being consistent with the Contexts.

  • Type: Classifier

    • Classes: grounded, not_grounded

  • Evaluation Prompt:

You are an expert evaluator of texts properties and characteristics.
Your task is to grade or label the input text or texts based on the provided definition, a detailed set of steps, and a grading rubric.
You must use the grading rubric to assign a score or label.

# Definition

Given a list of Contexts and Answer, groundedness refers to the Answer being consistent with the Contexts.
The Answer either contains information that is supported by the Contexts or assumes information that is available in the Context.

Use a step-by-step thinking process to ensure high-quality consideration of the grading criteria before reaching the conclusion.

# Steps

1. Analyze the content of the Answer and the Contexts.
2. Determine if the Answer contains false information or makes assumptions not supported by the Contexts.
3. Categorize the alignment of the Answer with the Contexts as one of the following grades: grounded if the Answer is consistent with the Contexts, ungrounded otherwise.


# Grading Criteria

- grounded: The Answer is grounded in the given contexts.
- ungrounded: The Answer is not grounded in the given contexts.


# Output Format

Only output the final evaluation score or label. Do not reveal the reasoning steps or any intermediate thoughts.
The response should be a valid JSON object with at least the following fields: "output".
The output format for the value should be a string that is one of the following classes: grounded, ungrounded.


# Examples

            **Input**
            Context: Paris is the capital and the largest city in France.
Answer: The capital of France is Paris.

            ** Internal Reasoning **
            The Answer is consistent with the Context. Paris is the capital of France.

            **Output**
            {
              "output": "grounded"
            }

            **Input**
            Context: The Denver Nuggets defeated the Miami Heat in five games, winning the NBA championship in 2023.
Answer: Joel Embiid was voted MVP of the NBA in 2023.

            ** Internal Reasoning **
            The Answer is not consistent with the Context. The Context does state any information of Joel Embiid being MVP of the NBA in 2023.

            **Output**
            {
              "output": "ungrounded"
            }


# Notes

- Always aim to provide a fair and balanced assessment.
- Consider both explicit statements and implicit tone.
- Consistency in labeling similar messages is crucial.
- Ensure the reasoning clearly justifies the assigned label based on the steps taken.


Context: {context}
Answer: {output}
chevron-rightllm_answer_refusalhashtag
  • Description: Classify whether the response from a QA system refused to answer the question.

  • Type: Classifier

    • Classes: refused, not_refused

  • Evaluation Prompt:

chevron-rightllm_answer_relevancyhashtag
  • Description: Given a Question and an Answer, determine if the Answer is relevant to the Question.

  • Type: Classifier

    • Classes: relevant, irrelevant

  • Evaluation Prompt:

chevron-rightllm_context_relevancyhashtag
  • Description: Context relevancy is evaluated based on the relevance of the provided list of Contexts to the user's Query.

  • Type: Classifier

    • Classes: relevant, irrelevant

  • Evaluation Prompt:

chevron-rightllm_question_clarityhashtag
  • Description: Context relevancy is evaluated based on the relevance of the provided list of Contexts to the user's Query.

  • Type: Scorer

    • Range: [1, 2, 3, 4, 5]

  • Evaluation Prompt:

chevron-rightllm_text_frustrationhashtag
  • Description: Assess the level of frustration in the input on a scale of 1 to 5.

  • Type: Scorer

    • Range: [1, 2, 3, 4, 5]

  • Evaluation Prompt:

chevron-rightllm_text_sentimenthashtag
  • Description: Determine whether the tone of the message is negative, neutral, or positive based on the content and context of the message provided.

  • Type: Classifier

    • Classes: negative, neutral, positive

  • Evaluation Prompt:

chevron-rightllm_text_similarityhashtag
  • Description: Text similarity is evaluated on the degree of syntactic and semantic similarity of the provided Output to the provided Target.

  • Type: Scorer

    • Range: [1, 2, 3, 4, 5]

  • Evaluation Prompt:

chevron-rightllm_text_toxicityhashtag
  • Description: Text toxicity evaluates how concerning or potentially harmful the text is from a safety perspective.

  • Type: Scorer

    • Range: [1, 2, 3, 4, 5]

  • Evaluation Prompt:

Was this helpful?