LLM-as-Judge Metric Templates

Pre-built templates to customize LLM-as-judge Metrics

Custom Metric Templates

Templates for creating entirely new LLM-as-Judge Metrics:

chevron-rightCustom Classifier Metrichashtag
  • Evaluation Prompt:

You are a classifier that classifies the given input according to predefined labels. Carefully read the reasoning for each label, then assign exactly one. Do not include any explanation or extra text.

## Input to be classified:
{your_column_name_here}

## Possible Labels:
<your_label_here>: <your reasoning here>
<your_label_here>: <your reasoning here>
chevron-rightCustom Scorer Metrichashtag
  • Evaluation Prompt:

You are an evaluator that assigns a score to the given the input, based on the reasoning defined below.

## Input to be scored:
{your_column_name_here}

## How to score:
<your reasoning here, make sure it only returns a score from [1, 2, 3, 4, 5]>

Default Metric Templates

Built in LLM-as-Judge Metrics that can be customized by the user:

chevron-righttopichashtag
  • Description: Classifies the conversation into a topic based on the input and output. This Metric is created after topics are automatically generated from the first 7 days of ingested data.

  • Type: classify

  • Classes: Topics are automatically generated based on your data

When to Use:

  • You need to categorize conversations by subject matter for reporting or routing

  • You want to understand the distribution of topics users are asking about

  • You need to track trends in specific subject areas over time

  • You want to segment analysis by conversation topic

Required Columns: input, output

  • Evaluation Prompt:

The following is a conversation between an AI assistant and a user:

<messages>
<message>user: {input}</message>
<message>assistant: {output}</message>
</messages>

# Task

Your job is to classify the conversation into one of the following topics.
Use both user and assistant messages in your decision.
Carefully consider each topic and choose the most appropriate one.
If you do not think the conversation is about any of the named topics, classify it as "other".

# List of topics

- topic1
- topic2
- topic3
chevron-rightllm_answer_groundednesshashtag
  • Description: Judge if the answer is adhering to the context

  • Type: classify

  • Inputs:

    • answer

    • context

  • Classes: grounded, ungrounded

  • Prompt:

You are an expert evaluator of texts properties and characteristics.
Your task is to grade or label the input text or texts based on the provided definition, a detailed set of steps, and a grading rubric.
You must use the grading rubric to assign a score or label.

# Definition

Given a list of Contexts and Answer, groundedness refers to the Answer being consistent with the Contexts.
The Answer either contains information that is supported by the Contexts or assumes information that is available in the Context.

Use a step-by-step thinking process to ensure high-quality consideration of the grading criteria before reaching the conclusion.

# Steps

1. Analyze the content of the Answer and the Contexts.
2. Determine if the Answer contains false information or makes assumptions not supported by the Contexts.
3. Categorize the alignment of the Answer with the Contexts as one of the following grades: grounded if the Answer is consistent with the Contexts, ungrounded otherwise.


# Grading Criteria

- grounded: The Answer is grounded in the given contexts.
- ungrounded: The Answer is not grounded in the given contexts.


# Output Format

Only output the final evaluation score or label. Do not reveal the reasoning steps or any intermediate thoughts.
The response should be a valid JSON object with at least the following fields: "output".
The output format for the value should be a string that is one of the following classes: grounded, ungrounded.


# Examples

            **Input**
            Context: Paris is the capital and the largest city in France.
Answer: The capital of France is Paris.

            ** Internal Reasoning **
            The Answer is consistent with the Context. Paris is the capital of France.

            **Output**
            {
              "output": "grounded"
            }

            **Input**
            Context: The Denver Nuggets defeated the Miami Heat in five games, winning the NBA championship in 2023.
Answer: Joel Embiid was voted MVP of the NBA in 2023.

            ** Internal Reasoning **
            The Answer is not consistent with the Context. The Context does state any information of Joel Embiid being MVP of the NBA in 2023.

            **Output**
            {
              "output": "ungrounded"
            }


# Notes

- Always aim to provide a fair and balanced assessment.
- Consider both explicit statements and implicit tone.
- Consistency in labeling similar messages is crucial.
- Ensure the reasoning clearly justifies the assigned label based on the steps taken.


Context: {context}
Answer: {output}
chevron-rightllm_answer_refusalhashtag
  • Description: Judge if the answer is a refusal to answer the question

  • Type: classify

  • Inputs:

    • answer

  • Classes: refused, not_refused

  • Prompt:

chevron-rightllm_answer_relevancyhashtag
  • Description: Judge if the answer is relevant to the question

  • Type: classify

  • Inputs:

    • question

    • answer

  • Classes: relevant, irrelevant

  • Prompt:

chevron-rightllm_context_relevancyhashtag
  • Description: LLM as Judge if the contexts are relevant to the question

  • Type: classify

  • Inputs:

    • question

    • context

  • Classes: relevant, irrelevant

  • Prompt:

chevron-rightllm_question_clarityhashtag
  • Description: Judge if the question is clear

  • Type: score

  • Inputs:

    • question

  • Prompt:

chevron-rightllm_summarizationhashtag
  • Description: Summarize the input and output of a conversational system.

  • Type: text

  • Inputs:

    • input

    • output

  • Prompt:

chevron-rightllm_text_frustrationhashtag
  • Description: Judge the frustration of text (default to input) on a scale of 1 to 5.

  • Type: score

  • Inputs:

    • text

  • Prompt:

chevron-rightllm_text_sentimenthashtag
  • Description: Judge the sentiment of a text as positive, negative, or neutral.

  • Type: classify

  • Inputs:

    • text

  • Classes: negative, neutral, positive

  • Prompt:

chevron-rightllm_text_similarityhashtag
  • Description: Judge the similarity of an output on a scale of 1 to 5, as compared to a target.

  • Type: score

  • Inputs:

    • output

    • reference

  • Prompt:

chevron-rightllm_text_toxicityhashtag
  • Description: Judge the toxicity of a text on a scale of 1 to 5.

  • Type: score

  • Inputs:

    • text

  • Prompt:

Was this helpful?