LogoLogo
AboutBlogLaunch app ↗
v0.20.x
v0.20.x
  • Introduction to AI Testing
  • Welcome to Distributional
  • Motivation
  • What is AI Testing?
  • Stages in the AI Software Development Lifecycle
    • Components of AI Testing
  • Distributional Testing
  • Getting Access to Distributional
  • Learning about Distributional
    • The Distributional Framework
    • Defining Tests in Distributional
      • Automated Production test creation & execution
      • Knowledge-based test creation
      • Comprehensive testing with Distributional
    • Reviewing Test Sessions and Runs in Distributional
      • Reviewing and recalibrating automated Production tests
      • Insights surfaced elsewhere on Distributional
      • Notifications
    • Data in Distributional
      • The flow of data
      • Components and the DAG for root cause analysis
      • Uploading data to Distributional
      • Living in your VPC
  • Using Distributional
    • Getting Started
    • Access
      • Organization and Namespaces
      • Users and Permissions
      • Tokens
    • Data
      • Data Objects
      • Run-Level Data
      • Data Storage Integrations
      • Data Access Controls
    • Testing
      • Creating Tests
        • Test Page
        • Test Drawer Through Shortcuts
        • Test Templates
        • SDK
      • Defining Assertions
      • Production Testing
        • Auto-Test Generation
        • Recalibration
        • Notable Results
        • Dynamic Baseline
      • Testing Strategies
        • Test That a Given Distribution Has Certain Properties
        • Test That Distributions Have the Same Statistics
        • Test That Columns Are Similarly Distributed
        • Test That Specific Results Have Matching Behavior
        • Test That Distributions Are Not the Same
      • Executing Tests
        • Manually Running Tests Via UI
        • Executing Tests Via SDK
      • Reviewing Tests
      • Using Filters
        • Filters in the Compare Page
        • Filters in Tests
    • Python SDK
      • Quick Start
      • Functions
        • login
        • Project
          • create_project
          • copy_project
          • export_project_as_json
          • get_project
          • get_or_create_project
          • import_project_from_json
        • Run Config
          • create_run_config
          • get_latest_run_config
          • get_run_config
          • get_run_config_from_latest_run
        • Run Results
          • get_column_results
          • get_scalar_results
          • get_results
          • report_column_results
          • report_scalar_results
          • report_results
        • Run
          • close_run
          • create_run
          • get_run
          • report_run_with_results
        • Baseline
          • create_run_query
          • get_run_query
          • set_run_as_baseline
          • set_run_query_as_baseline
        • Test Session
          • create_test_session
      • Objects
        • Project
        • RunConfig
        • Run
        • RunQuery
        • TestSession
        • TestRecalibrationSession
        • TestGenerationSession
        • ResultData
      • Experimental Functions
        • create_test
        • get_tests
        • get_test_sessions
        • wait_for_test_session
        • get_or_create_tag
        • prepare_incomplete_test_spec_payload
        • create_test_recalibration_session
        • wait_for_test_recalibration_session
        • create_test_generation_session
        • wait_for_test_generation_session
      • Eval Module
        • Quick Start
        • Application Metric Sets
        • How-To / FAQ
        • LLM-as-judge and Embedding Metrics
        • RAG / Question Answer Example
        • Eval Module Functions
          • Index of functions
          • eval
          • eval.metrics
    • Notifications
    • Release Notes
  • Tutorials
    • Instructions
    • Hello World (Sentiment Classifier)
    • Trading Strategy
    • LLM Text Summarization
      • Setting the Scene
      • Prompt Engineering
      • Integration testing for text summarization
      • Practical considerations
Powered by GitBook

© 2025 Distributional, Inc. All Rights Reserved.

On this page

Was this helpful?

Export as PDF
  1. Stages in the AI Software Development Lifecycle

Components of AI Testing

Essential Components of AI Testing

An AI Testing strategy can and often includes popular reliability modalities such as model evaluation and traditional monitoring of summary statistics.

  • Evaluation frameworks are an important tool in the development phase of AI application and can provide insights such as “based on X metric, model A is better than model B on this test set.” In development, the primary role of evaluations is to ensure that the application is performing at an acceptable level. After the app is deployed to production, online evaluations can be performed on production data to check how key performance metrics vary over time. Evaluation frameworks provide a set of off-the shelf metrics that you can use to perform evaluations, however many people choose to write custom evaluations related to business metrics or other objectives specific to their use case.

  • Observability/monitoring tools log and can respond to events in production. By nature, observability tools catch issues after the fact, treating users as tests, and often miss subtle degradations. Most observability tools look at summary statistics (e.g. P90, P50, accuracy, F1-score, BLEU, ROUGE) over time, but do not provide insights on the underlying distributions of those metrics, and across time.

In an AI Testing framework, you can use eval metrics as the basis for tests by setting thresholds and an alert when that test, or some set of tests fails. With Distributional in particular, you can bring your own custom evals, eval metrics from an open source library, or use the off-the-shelf eval metrics provided by the platform.

Tests can also evaluate more abstract concepts such as whether or not your app looks similar to the day (or week) before.

PreviousStages in the AI Software Development LifecycleNextDistributional Testing

Was this helpful?