Metrics

Codify signals to track behavior that matters

A Metric is a mapping from Columns into meaningful numeric values representing cost, quality, performance, or other behavioral characteristics. Metrics are computed for every ingested log or trace as part of the DBNL Data Pipeline and show up in the Logs view, Explorer pages, and Metrics Dashboard.

DBNL comes with many built in metrics and templates that can be customized. Fundamentally, Metrics are one of two types:

  • LLM-as-judge Metrics: Evals and judges that require an LLM to compute a score or classification based on a prompt.

  • Standard Metrics: Functions that can be computed using non-LLM methods like traditional Natural Language Processing (NLP) metrics, statistical operations, and other common mapping functions.

Default Metrics

Every product contains the following metrics by default, computed using the required input and output fields of the DBNL Semantic Convention and the default Model Connection for the Project:

  • answer_relevancy: Determines if the input is relevant to the output. See template.

  • user_frustration: Assesses the level of frustration of the input based on tone, word choice, and other properties. See template.

  • topic: Classifies the conversation into a topic based on the input and output. This Metric is created after topics are automatically generated from the first 7 days of ingested data. Topics can be manually adjusted by editing the template.

  • conversation_summary (immutable): A summary of the input and output, used as part of topic generation.

  • summary_embedding (immutable): An embedding of the conversation_summary, used as part of topic generation.

Creating a Metric

Metrics can be created by clicking on the "+ Create New Metric" button on the Metrics page.

When to Create a Metric

Create custom metrics when you need to:

  • Track specific business KPIs: Cost per conversation, resolution rate, escalation frequency

  • Monitor quality signals: Response accuracy, hallucination detection, safety violations

  • Measure performance: Response time, token efficiency, context utilization

  • Validate against requirements: Brand tone compliance, length constraints, format adherence

  • Debug recurring issues: Track patterns identified in Insights or Logs exploration

Good metrics are:

  • Actionable: The metric should inform decisions or trigger alerts

  • Measurable: Clear numeric or categorical output for every log

  • Relevant: Tied to product quality, user experience, or business outcomes

  • Consistent: Produces reliable results across similar inputs

Start with DBNL's default metrics and templates. Only create custom metrics after you've identified specific signals through the Explorer or Insights that aren't covered by existing metrics.

When to Use Standard vs LLM-as-Judge Metrics

  • Use Standard Metrics when: You need fast, deterministic calculations (word counts, text length, keyword matching, readability scores)

  • Use LLM-as-Judge Metrics when: You need semantic understanding (relevance, tone, quality, groundedness)

Standard Metrics are faster and cheaper to compute, so prefer them when possible.

LLM-as-Judge Metrics

LLM-as-Judge Metrics can be customized from the built in LLM-as-Judge Metric Templates. Each of these Metrics is one of two types:

  • Classifier Metric: Outputs a categorical value equal to one of a predefined set of classes. Example: llm_answer_groundedness.

  • Scorer Metric: Outputs an integer in the range [1, 2, 3, 4, 5]. Example: llm_text_frustration.

Standard Metrics

Standard Metrics are functions that can be computed using non-LLM methods. They can be built using the Functions available in the DBNL Query Language.

Creating Standard Metrics

Standard Metrics use query language expressions to compute values from your log columns. Here are common examples:

Example 1: Calculate Response Length

Track the word count of AI responses:

  • Metric Name: response_word_count

  • Type: Standard Metric

  • Formula: word_count(output)

Example 2: Detect Refusal Keywords

Identify when the AI refuses to answer:

  • Metric Name: contains_refusal

  • Type: Standard Metric

  • Formula: contains(lower(output), 'sorry') or contains(lower(output), 'cannot') or contains(lower(output), 'unable')

Example 3: Calculate Input Complexity

Measure how complex user prompts are:

  • Metric Name: input_reading_level

  • Type: Standard Metric

  • Formula: flesch_kincaid_grade(input)

Example 4: Detect Question Marks

Check if input is a question:

  • Metric Name: is_question

  • Type: Standard Metric

  • Formula: contains(input, '?')

Example 5: Compare String Similarity

Measure how similar input and output are (useful for detecting parroting):

  • Metric Name: input_output_similarity

  • Type: Standard Metric

  • Formula: 1.0 - (levenshtein(input, output) / max(len(input), len(output)))

Troubleshooting Metrics

Metric Not Appearing in Logs or Dashboard

Possible causes:

  • The metric was created after logs were ingested - metrics only compute for new data after creation

  • The pipeline run failed during the Enrich step - check the Status page

  • The metric references a column that doesn't exist in your data

Solution: Check Status page for errors, verify column names, and wait for the next pipeline run.

LLM-as-Judge Metric Returns Unexpected Values

Possible causes:

  • The Model Connection is using a different model than expected

  • The evaluation prompt is ambiguous or unclear

  • The column placeholders (e.g., {input}, {output}) are incorrect

Solution: Test your Model Connection using the "Validate" button, review example logs to check if columns have expected values, and refine the evaluation prompt for clarity.

Standard Metric Formula Errors

Common errors:

# Error: Column doesn't exist
word_count(ouput)  # Typo - should be 'output'

# Error: Wrong function name
wordcount(output)  # Should be 'word_count'

# Error: Type mismatch
word_count(total_token_count)  # Can't count words in a number

# Error: Division by zero
divide(output_tokens, input_tokens)  # Fails if input_tokens is 0

Solution: Use the Query Language Functions reference to verify syntax, check column names match your data exactly, and add null/zero checks with conditionals.

Metric Computation is Slow

Possible causes:

  • LLM-as-Judge metrics are inherently slower (require Model Connection calls for each log)

  • Your Model Connection has high latency or rate limits

  • Large log volume

Solution: Use Standard Metrics where possible, consider a faster Model Connection (like local NVIDIA NIM), or increase pipeline timeout settings.

Metric Values Are All Null

Possible causes:

  • Required columns are missing from your logs

  • Formula syntax error causing computation to fail silently

  • Model Connection is unreachable or returning errors

Solution: Check logs to verify required columns exist, test formula on a small subset, validate Model Connection, and check Status page for pipeline errors.

Need more help? Contact [email protected] or visit distributional.com/contact. Include your metric definition and any error messages from the Status page.

Was this helpful?