# Metrics

A Metric is a mapping from [Columns](https://docs.dbnl.com/configuration/data-pipeline#columns) into meaningful numeric values representing cost, quality, performance, or other behavioral characteristics. Metrics are computed for every ingested log or trace as part of the [DBNL Data Pipeline](https://docs.dbnl.com/configuration/data-pipeline) and show up in the [Logs](https://docs.dbnl.com/workflow/logs) view, [Explorer](https://docs.dbnl.com/workflow/explorer) pages, and [Metrics Dashboard](https://docs.dbnl.com/dashboards#metrics-dashboard).

DBNL comes with many built in metrics and templates that can be customized. Fundamentally, Metrics are one of two types:

* [**LLM-as-judge Metrics**](#llm-as-judge-metrics): Evals and judges that require an LLM to compute a score or classification based on a prompt.
* [**Standard Metrics**](#standard-metrics): Functions that can be computed using non-LLM methods like traditional Natural Language Processing (NLP) metrics, statistical operations, and other common mapping [functions](https://docs.dbnl.com/reference/query-language/functions).

### Default Metrics

Every product contains the following metrics by default, computed using the required `input` and `output` fields of the [DBNL Semantic Convention](https://docs.dbnl.com/configuration/dbnl-semantic-convention) and the default [Model Connection](https://docs.dbnl.com/configuration/model-connections) for the [Project](https://docs.dbnl.com/workflow/projects):

* `answer_relevancy`: Determines if the `input` is relevant to the `output`. See [template](https://docs.dbnl.com/workflow/llm-as-judge-metric-templates#llm_answer_relevancy).
* `user_frustration`: Assesses the level of frustration of the `input` based on tone, word choice, and other properties. See [template](https://docs.dbnl.com/workflow/llm-as-judge-metric-templates#llm_text_frustration).
* `topic`: Classifies the conversation into a topic based on the `input` and `output`. This Metric is created after topics are automatically generated from the first 7 days of ingested data. Topics can be manually adjusted by editing the [template](https://docs.dbnl.com/workflow/llm-as-judge-metric-templates#topic).
* `conversation_summary` (immutable): A summary of the `input` and `output`, used as part of `topic` generation.
* `summary_embedding` (immutable): An embedding of the `conversation_summary`, used as part of `topic` generation.

### Creating a Metric

Metrics can be created by clicking on the "+ Create New Metric" button on the Metrics page.

<figure><img src="https://content.gitbook.com/content/yx9NXaWRjaOtW8ILLJQO/blobs/vo75JVsQ3Sbut3cTmMWX/image.png" alt=""><figcaption></figcaption></figure>

### When to Create a Metric

Create custom metrics when you need to:

* **Track specific business KPIs**: Cost per conversation, resolution rate, escalation frequency
* **Monitor quality signals**: Response accuracy, hallucination detection, safety violations
* **Measure performance**: Response time, token efficiency, context utilization
* **Validate against requirements**: Brand tone compliance, length constraints, format adherence
* **Debug recurring issues**: Track patterns identified in Insights or Logs exploration

**Good metrics are:**

* **Actionable**: The metric should inform decisions or trigger alerts
* **Measurable**: Clear numeric or categorical output for every log
* **Relevant**: Tied to product quality, user experience, or business outcomes
* **Consistent**: Produces reliable results across similar inputs

{% hint style="info" %}
Start with DBNL's default metrics and templates. Only create custom metrics after you've identified specific signals through the [Explorer](https://docs.dbnl.com/workflow/explorer) or [Insights](https://docs.dbnl.com/workflow/insights) that aren't covered by existing metrics.
{% endhint %}

### When to Use Standard vs LLM-as-Judge Metrics

* **Use Standard Metrics when:** You need fast, deterministic calculations (word counts, text length, keyword matching, readability scores)
* **Use LLM-as-Judge Metrics when:** You need semantic understanding (relevance, tone, quality, groundedness)

Standard Metrics are faster and cheaper to compute, so prefer them when possible.

### LLM-as-Judge Metrics

LLM-as-Judge Metrics can be customized from the built in [LLM-as-Judge Metric Templates](https://docs.dbnl.com/workflow/metrics/llm-as-judge-metric-templates). Each of these Metrics is one of two types:

* Classifier Metric: Outputs a categorical value equal to one of a predefined set of classes. Example: [`llm_answer_groundedness`](https://docs.dbnl.com/workflow/llm-as-judge-metric-templates#llm_answer_groundedness).
* Scorer Metric: Outputs an integer in the range `[1, 2, 3, 4, 5]`. Example: [`llm_text_frustration`](https://docs.dbnl.com/workflow/llm-as-judge-metric-templates#llm_text_frustration).

### Standard Metrics

Standard Metrics are functions that can be computed using non-LLM methods. They can be built using the [Functions](https://docs.dbnl.com/reference/query-language/functions) available in the [DBNL Query Language](https://docs.dbnl.com/reference/query-language).

#### Creating Standard Metrics

Standard Metrics use query language expressions to compute values from your log columns. Here are common examples:

<details>

<summary>Example 1: Calculate Response Length</summary>

Track the word count of AI responses:

* **Metric Name:** `response_word_count`
* **Type:** Standard Metric
* **Formula:** `word_count({RUN}.output)`

</details>

<details>

<summary>Example 2: Detect Refusal Keywords</summary>

Identify when the AI refuses to answer:

* **Metric Name:** `contains_refusal`
* **Type:** Standard Metric
* **Formula:** `or(or(contains(lower({RUN}.output), "sorry"), contains(lower({RUN}.output), "cannot")), contains(lower({RUN}.output), "unable"))`

</details>

<details>

<summary>Example 3: Calculate Input Complexity</summary>

Measure how complex user prompts are:

* **Metric Name:** `input_reading_level`
* **Type:** Standard Metric
* **Formula:** `flesch_kincaid_grade({RUN}.input)`

</details>

<details>

<summary>Example 4: Detect Question Marks</summary>

Check if input is a question:

* **Metric Name:** `is_question`
* **Type:** Standard Metric
* **Formula:** `contains({RUN}.input, "?")`

</details>

<details>

<summary>Example 5: Compare String Similarity</summary>

Measure how similar input and output are (useful for detecting parroting):

* **Metric Name:** `input_output_similarity`
* **Type:** Standard Metric
* **Formula:** `subtract(1.0, divide(levenshtein({RUN}.input, {RUN}.output), max(len({RUN}.input), len({RUN}.output))))`

</details>

### Troubleshooting Metrics

<details>

<summary>Metric Not Appearing in Logs or Dashboard</summary>

**Possible causes:**

* The metric was created after logs were ingested - metrics only compute for new data after creation
* The pipeline run failed during the Enrich step - check the [Status page](https://docs.dbnl.com/workflow/status)
* The metric references a column that doesn't exist in your data

**Solution**: Check Status page for errors, verify column names, and wait for the next pipeline run.

</details>

<details>

<summary>LLM-as-Judge Metric Returns Unexpected Values</summary>

**Possible causes:**

* The Model Connection is using a different model than expected
* The evaluation prompt is ambiguous or unclear
* The column placeholders (e.g., `{input}`, `{output}`) are incorrect

**Solution**: Test your Model Connection using the "Validate" button, review example logs to check if columns have expected values, and refine the evaluation prompt for clarity.

</details>

<details>

<summary>Standard Metric Formula Errors</summary>

**Common errors:**

```
# Error: Column doesn't exist
word_count(ouput)  # Typo - should be 'output'

# Error: Wrong function name
wordcount(output)  # Should be 'word_count'

# Error: Type mismatch
word_count(total_token_count)  # Can't count words in a number

# Error: Division by zero
divide(output_tokens, input_tokens)  # Fails if input_tokens is 0
```

**Solution**: Use the [Query Language Functions](https://docs.dbnl.com/reference/query-language/functions) reference to verify syntax, check column names match your data exactly, and add null/zero checks with conditionals.

</details>

<details>

<summary>Metric Computation is Slow</summary>

**Possible causes:**

* LLM-as-Judge metrics are inherently slower (require Model Connection calls for each log)
* Your Model Connection has high latency or rate limits
* Large log volume

**Solution**: Use Standard Metrics where possible, consider a faster Model Connection (like local NVIDIA NIM), or increase pipeline timeout settings.

</details>

<details>

<summary>Metric Values Are All Null</summary>

**Possible causes:**

* Required columns are missing from your logs
* Formula syntax error causing computation to fail silently
* Model Connection is unreachable or returning errors

**Solution**: Check logs to verify required columns exist, test formula on a small subset, validate Model Connection, and check Status page for pipeline errors.

</details>

{% hint style="info" %}
**Need more help?** Contact <support@distributional.com> or visit [distributional.com/contact](https://distributional.com/contact). Include your metric definition and any error messages from the Status page.
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.dbnl.com/workflow/metrics.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
