# Predicting Credit Worthiness Using Tabular Data

The data files required for this tutorial are available in the following file.

{% file src="/files/CnojRjnyXhRaZTPLvjKg" %}
Credit Worthiness Tutorial files
{% endfile %}

### Credit Worthiness Introduction

A bank customer applies for a line of credit; to assess the creditworthiness of the customer, the bank retrieves data from its own warehouse and several third-party API endpoints. This data is then utilized to predict the customer's creditworthiness.

### Motivation

Predicting creditworthiness accurately is a cornerstone of banking operations, enabling financial institutions to manage risk effectively and ensure customers are not overburdened with unmanageable credit. With the advent of data-driven decision making, banks can leverage vast amounts of data to make these predictions more effectively.

This tutorial demonstrates the use of on dbnl on tabular data to ensure consistent predictions of creditworthiness through continuous testing of third-party endpoints and testing a live production system. These steps are crucial as predicting creditworthiness involves integrating data from various sources, including a bank's own data warehouse and several third-party API endpoints. Furthermore, in a live environment, the system interacts with real-time data, introducing additional complexities.

## Defining the Credit Worthiness System

The creditworthiness prediction system incorporates several third-party endpoints. To prevent potential harm to consumers, it's crucial to regularly test these endpoints for expected behavior. In this tutorial, we conduct monthly tests on these third-party endpoints against a pre-established baseline.

To complete this tutorial, the run config, test payload and data (as stored in runs) are needed. All can be found using the following [link](broken://pages/MuOH2XN2DzsYD7hqZu0m). The Credit Worthiness system consists of 11 different components.&#x20;

<figure><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXc2opPtVhja9H4kTi-a4RNyNvXkb62U1SmwxUAHg1PdyGRsnsAUUS7xbrHVYVlWE7VL9hzeI44Z9diRYRhiAMehYpyhx13N7djucxGRvq_c4tRxthmkjMMbHBdxMzTNWE-w0_K3jH0Etbe6g1q4qwVaWsbs?key=O9L8ng0orkd7yT50-H8c7w" alt=""><figcaption><p>The system used throughout the Credit Worthiness demo.</p></figcaption></figure>

<details>

<summary><strong>Component Definitions</strong></summary>

For each component, we are specifying the various columns and identifying the respective owners of the different components.

**Component: Credit\_request**

A bank customer submits a credit request that includes details about the intended purchase, its price, the repayment plan, and whether there's a co-signer for the line of credit.

**Columns:**

* **Purpose** - Purpose for credit
* **Guarantors** - Guarantors
* **Instalment\_per\_cent** - Installment %
* **Duration\_of\_Credit\_\_month\_** - Duration of credit per month
* **Credit\_Amount** - Amount of Credit

***

**Component: Unique\_ID**

An identifier, used by the bank, facilitates the association of customer information with their internal records and enables data retrieval from third-party applications.

**Columns:**

* **SSN** - Unique ID

***

**Component: Data Warehouse**

The bank maintains a repository of customer-related information, which, for simplicity, is categorized into account information and personal information.

In the context of continuous testing, the actual creditworthiness classification for each customer, based on their request, is known.

**Columns:**

* **Target\_Worthiness** - True classes

***

**Component: Account Information**

Subset of the Data Warehouse containing information about the value of the assets in a customer account.

**Columns:**

* **Account\_Balance** - Account balance
* **Value\_Savings\_Stocks** - Savings/stock value

***

**Component: Personal Information**

Subset of the Data Warehouse containing personal information about a customer of the bank.

**Columns:**

* **Sex\_\_\_Marital\_Status** - Sex/Marital status
* **Most\_valuable\_available\_asset** - Most valuable available asset
* **Age\_\_years\_** - Age (years)
* **No\_of\_dependents** - Number of dependents

***

**Component: API:Credit\_History**

Third party API providing the information about credit history for a given bank customer.

**Columns:**

* **Payment\_Status\_of\_Previous\_Credit** - Payment Status
* **No\_of\_Credits\_at\_this\_Bank** - Number of credits at this Bank

***

**Component: API:Credit\_Report**

Third party API providing a FICO (credit) score for a given bank customer.

**Columns:**

* **FICO\_score** - FICO score

***

**Component: API:Employment\_Veri. (verification)**

Third party API providing employment verification for a given bank customer.

**Columns:**

* **Length\_of\_current\_employment** - Length of current employment
* **Foreign\_Worker** - Foreign worker
* **Occupation** - Occupation

***

**Component: API:Rental\_History**

Third party API providing information about the rental history of a given customer.

**Columns:**

* **Duration\_in\_Current\_address** - Duration in current address
* **Type\_of\_apartment** - Type of apartment

***

**Component: XGB:Classifier**

Output of the XGBoost classifier used to predict whether a line of credit should get approved for a given customer.

**Columns:**

* **Predicted\_Worthiness** - Predicted classes
* **Probability\_Bad** - Probability for class bad
* **Probability\_Good** - Probability for class good
* **Latency\_\_ms** - Latency for the model

***

**Component: Evaluation**

Run-level metrics coming from comparing the predicted credit worthiness class to the true credit worthiness class.

**Scalars:**

* **Model\_Accuracy** - Accuracy for the model
* **Model\_F1** - F1-score for the model
* **Model\_Precision** - Precision for the model
* **Model\_Recall** - Recall for the model

</details>

<details>

<summary>Example Run Configuration - find complete run configuration in the tutorial zip file</summary>

**Example Column Configuration**

```yaml
{
    "component": "Account_information",
    "description": "Account balance",
    "name": "Account_Balance",
    "type": "category"
},
{
    "component": "Account_information",
    "description": "Savings/stock value",
    "name": "Value_Savings_Stocks",
    "type": "category"
},
{
    "component": "Personal_information",
    "description": "Sex/Marital status",
    "name": "Sex___Marital_Status",
    "type": "category"
},
{
    "component": "Personal_information",
    "description": "Most valuable available asset",
    "name": "Most_valuable_available_asset",
    "type": "category"
}
```

**Example Scalar Configuration**

```json
{
    "component": "Evaluation",
    "description": "Accuracy for the model",
    "name": "Model_Accuracy",
    "type": "float"
},
{
    "component": "Evaluation",
    "description": "F1-score for the model",
    "name": "Model_F1",
    "type": "float"
},
{
    "component": "Evaluation",
    "description": "Precision for the model",
    "name": "Model_Precision",
    "type": "float"
},
{
    "component": "Evaluation",
    "description": "Recall for the model",
    "name": "Model_Recall",
    "type": "float"
}
```

**Example Component DAG**

```yaml
"API:Credit_History": [
    "XGB:Classifier"
],
"API:Credit_Report": [
    "XGB:Classifier"
],
"API:Employment_Veri.": [
    "XGB:Classifier"
],
"API:Rental_History": [
    "XGB:Classifier"
],
"Account_information": [
    "XGB:Classifier"
],
```

</details>

## Creating dbnl Tests

To test the credit worthiness system, 7 different groupings of tests are used.  These are denoted using the dbnl [test tagging](https://docs.dbnl.com/creating-tests-in-distributional/dbnl-testing-objects) strategy.

### Change in probability

This set of tests pertains to the distribution shape of the probability scores. Utilizing the scaled [Kolmogorov-Smirnov statistic](https://docs.dbnl.com/creating-tests-in-distributional/suggested-testing-strategies/tests-that-columns-are-similarly-distributed), the test is deemed unsuccessful if the difference between the distribution shapes of the baseline and experimental probabilities exceeds 0.5.

<figure><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXdbHjvWoMQl7a-W2auUouIIClXwI_j8GaHDJRnMNB5F1zKRKJfNDRb8J14YjV7lVTm-gpz4_FAsQsKIY-RInI--SX0-Dsho7idbx07ow1YueWhN27P4Uiuv3Z_E5fCuY3wFrQsksLTV9Tmr3Txh_SR2oLM?key=O9L8ng0orkd7yT50-H8c7w" alt=""><figcaption><p>Test configuration</p></figcaption></figure>

<details>

<summary>Example Test Payload - find full list of test payloads in the tutorial zip file</summary>

```yaml
{
    "name": "Gr.0: Non Parametric Difference: Probability_Bad",
    "description": "Testing the difference in shape for the probability run over run",
    "tag_names": [
        "ProbabilityChange"
    ],
    "assertion": {
        "name": "less_than",
        "params": {
            "other": 0.5
        }
    },
    "statistic_name": "scaled_ks_stat",
    "statistic_params": {},
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.Probability_Bad"
            }
        },
        {
            "select_query_template": {
                "select": "{BASELINE}.Probability_Bad"
            }
        }
    ]
}
```

</details>

### Shift in Probability

This set of tests examines the [shift in probability](https://docs.dbnl.com/creating-tests-in-distributional/suggested-testing-strategies/tests-that-specific-results-have-matching-behavior-matched-runs) between the baseline and the experiment. The test is considered unsuccessful if over 10% of the results exhibit a shift greater than 0.1 (10%) when comparing the baseline probabilities with the experimental probabilities.

<figure><img src="/files/ULigwrlMnIVz0VRlWmcj" alt=""><figcaption></figcaption></figure>

<details>

<summary>Example Test Payload - find full list of test payloads in the tutorial zip file</summary>

<pre class="language-yaml"><code class="lang-yaml"><strong>{
</strong>    "name": "Gr.0: Parametric Difference: Probability_Bad",
    "description": "Testing the shift in probability run over run",
    "tag_names": [
        "ProbabilityChangeDiff"
    ],
    "assertion": {
        "name": "less_than",
        "params": {
            "other": 0.1
        }
    },
    "statistic_name": "percentile",
    "statistic_params": {
        "percentage": 0.9
    },
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "abs({EXPERIMENT}.Probability_Bad - {BASELINE}.Probability_Bad)"
            }
        }
    ]
}
</code></pre>

</details>

### Minimum Performance

This set of tests pertains to the [minimum performance threshold](https://docs.dbnl.com/creating-tests-in-distributional/suggested-testing-strategies/tests-that-a-given-distribution-has-certain-properties) of the experimental model. The test is deemed unsuccessful if the performance of any run-level metrics falls below 0.8 (80%).

<figure><img src="/files/MRvvqBALQDJ3daALzx5t" alt=""><figcaption><p>Test configuration</p></figcaption></figure>

{% hint style="info" %}
Since run-level data like`Model_Accuracy` are scalar value. We use the `scalar` statistic name to indicate that we are comparing the value itself to the assertion.
{% endhint %}

<details>

<summary>Example Test Payload - find full list of test payloads in the tutorial zip file</summary>

```json
{
    "name": "Gr.1: Minimum Performance: Model_Accuracy",
    "description": "Testing that we are reaching our minimum requirements for performance",
    "tag_names": [
        "MinAcceptablePerf"
    ],
    "assertion": {
        "name": "greater_than",
        "params": {
            "other": 0.8
        }
    },
    "statistic_name": "scalar",
    "statistic_params": {},
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.Model_Accuracy"
            }
        }
    ]
}
```

</details>

### Relative performance

This set of tests is concerned with the relative performance of the run-level metrics. A test is considered unsuccessful if there's a performance shift greater than 0.1 (10%) for any of the run-level metrics when comparing a baseline run to an experimental run.

<figure><img src="/files/FfevroaEqj7fBgJThq0f" alt=""><figcaption><p>Test configuration</p></figcaption></figure>

<details>

<summary>Example Test Payload - find full list of test payloads in the tutorial zip file</summary>

```yaml
{
    "name": "Gr.2: Relative Performance: Model_Accuracy",
    "description": "Testing that we are reaching our relative performance requirements",
    "tag_names": [
        "RelativePerf"
    ],
    "assertion": {
        "name": "close_to",
        "params": {
            "other": 0.0,
            "tolerance": 0.1
        }
    },
    "statistic_name": "scalar",
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "abs({EXPERIMENT}.Model_Accuracy - {BASELINE}.Model_Accuracy)"
            }
        }
    ]
}
```

</details>

### Consistency of account/personal information

This set of tests pertains to the consistency of outcomes associated with either account or personal information. A test is deemed unsuccessful if the difference, calculated using either scaled [Kolmogorov-Smirnov or scaled chi-squared statistics](https://docs.dbnl.com/creating-tests-in-distributional/suggested-testing-strategies/tests-that-columns-are-similarly-distributed), between baseline and experimental outcomes exceeds 0.55.

<figure><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXfI3gGo-ZkQDO11VDuMOIW3g9M65OjSrMaBDPc-tpEWhYBAZWgnlXWWuF6LbEyT85WOcgyNNxT1LO-uGOLR8jQJxTq9dVmAZYn9bU3uXy73zbbBvjkjtwmVDMTQs0BiasWjmbrOvbXUWnH6vwOSKKhywNxK?key=O9L8ng0orkd7yT50-H8c7w" alt=""><figcaption><p>Test configuration</p></figcaption></figure>

<details>

<summary>Example Test Payload - find full list of test payloads in the tutorial zip file</summary>

```yaml
{
    "name": "Gr.3: Account_information: Account_Balance",
    "description": "Testing difference in feature distributions between training and testing data",
    "tag_names": [
        "DataConsistency"
    ],
    "assertion": {
        "name": "close_to",
        "params": {
            "other": 0.0,
            "tolerance": 0.55
        }
    },
    "statistic_name": "scaled_chi2_stat",
    "statistic_params": {},
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.Account_Balance"
            }
        },
        {
            "select_query_template": {
                "select": "{BASELINE}.Account_Balance"
            }
        }
    ]
}
```

</details>

### Consistency of API endpoints

This set of tests focuses on the consistency of outcomes associated with different third-party API endpoints. A test is considered unsuccessful if the difference, calculated using either scaled Kolmogorov-Smirnov or scaled [chi-squared statistics](https://docs.dbnl.com/creating-tests-in-distributional/suggested-testing-strategies/tests-that-columns-are-similarly-distributed), between baseline and experimental outcomes exceeds 0.55.

<figure><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXe2fsMGK9sI45C2Or0btLezsmyHv6GVmYwB4-dKmm83Z6I6A-dXcF1urJ_RfDY4NV8sHXFvOudOMqWbOcsgBAXO1-RlmQfh5f293WthDavlvHzZ-QlXnoxLZKfXWnhVLuj-umtwSnfVdYlQaGgpx1oatKQ?key=O9L8ng0orkd7yT50-H8c7w" alt=""><figcaption><p>Test configuration</p></figcaption></figure>

<details>

<summary>Example Test Payload - find full list of test payloads in the tutorial zip file</summary>

```yaml
{
    "name": "Gr.4: API:Credit_History: Concurrent_Credits",
    "description": "Testing difference in feature distributions between training and testing data",
    "tag_names": [
        "ApplicationConsistency"
    ],
    "assertion": {
        "name": "close_to",
        "params": {
            "other": 0.0,
            "tolerance": 0.55
        }
    },
    "statistic_name": "scaled_chi2_stat",
    "statistic_params": {},
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.Length_of_current_employment"
            }
        },
        {
            "select_query_template": {
            "select": "{BASELINE}.Length_of_current_employment"
            }
        }
    ]
}
```

</details>

### Consistency of credit request

This set of tests pertains to the consistency of outcomes associated with credit requests. A test is deemed unsuccessful if the difference, calculated using either scaled Kolmogorov-Smirnov or [scaled chi-squared statistics](https://docs.dbnl.com/creating-tests-in-distributional/suggested-testing-strategies/tests-that-columns-are-similarly-distributed), between baseline and experimental outcomes exceeds 0.55.

<figure><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXfjYJF042U9DM2dpxqEsvzQIugpewPRJ4jk234_9ZWOSqtEP8P2iYVrhJ0hylBwn3vhDwzXVOgRL0eiEfPBoOQmm0xY8aCZhrQ3lDrt5uUYG-oDgAX6otk_zOX14tt6NOvJUnCFA3pj80Tf6016zFPoCZRR?key=O9L8ng0orkd7yT50-H8c7w" alt=""><figcaption><p>Test configuration</p></figcaption></figure>

<details>

<summary>Example Test Payload - find full list of test payloads in the tutorial zip file</summary>

```yaml
{
    "name": "Gr.5: Credit_Requst: Credit_Amount",
    "description": "Testing difference in feature distributions between training and testing data",
    "tag_names": [
        "PurposeConsistency"
    ],
    "assertion": {
        "name": "close_to",
        "params": {
            "other": 0.0,
            "tolerance": 0.55
        }
    },
    "statistic_name": "scaled_ks_stat",
    "statistic_params": {},
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.Credit_Amount"
            }
        },
        {
            "select_query_template": {
                "select": "{BASELINE}.Credit_Amount"
            }
        }
    ]
}
```

</details>

### Performance of the system

This set of tests focuses on the consistency of outcomes related to the system's performance. A test is considered unsuccessful if the difference, calculated using either scaled Kolmogorov-Smirnov or scaled [chi-squared statistics](https://docs.dbnl.com/creating-tests-in-distributional/suggested-testing-strategies/tests-that-columns-are-similarly-distributed), between baseline and experimental outcomes exceeds 0.55.

<figure><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXcJ2PQz0Lp-SdzC4OgD2JUMQByok-nRcQUBiLHS-zJxs7ZpWW1kyhCmqTbQMZjbtTykQP62MuouP7wBoGNtIryxonUkfD_A_9b4M38ZvsLg0dypFLXfJiuzQwviBs7CeA7ui8JmROgzitMsL2ZSnyyQkTFk?key=O9L8ng0orkd7yT50-H8c7w" alt=""><figcaption><p>Test configuration</p></figcaption></figure>

<details>

<summary>Example Test Payload - find full list of test payloads in the tutorial zip file</summary>

```yaml
{
    "name": "Gr.6: Sytem Performance: latency",
    "description": "Testing difference in feature distributions between training and testing data",
    "tag_names": [
        "SystemPerformance"
    ],
    "assertion": {
        "name": "close_to",
        "params": {
            "other": 0.0,
            "tolerance": 0.55
        }
    },
    "statistic_name": "scaled_ks_stat",
    "statistic_params": {},
    "statistic_inputs": [
        {
            "select_query_template": {
                "select": "{EXPERIMENT}.latency__ms"
            }
        },
        {
            "select_query_template": {
                "select": "{BASELINE}.latency__ms"
            }
        }
    ]
}
```

</details>

## Integration Testing

The objective of the integration test is to detect and alert when the third-party API endpoints begin to exhibit different behavior. This involves conducting system tests against a known baseline every month, or at whatever cadence is appropriate.

Given the aim of ensuring consistency across different third-party API endpoints, the same dataset is used to establish the baseline and conduct the monthly experiment.

After nine months of running the integration test, it has been observed that several tests have failed twice.

<figure><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXejS25Cuo6mQF8xORqu1sX2ZvaZcsAmVvij2g4fRKKReovwLzS0pjnqFHMAlyJfxBf3CM6JMPV1Ikyo8n2HkiZuMzD0OLQ-MxcscKwmqOToKy5TGZ7w53WLUMdXv_4xRPEOEX9-kHmtI97fhjJLc63aa_Q?key=O9L8ng0orkd7yT50-H8c7w" alt=""><figcaption><p>Completed Test Sessions for Integration Test</p></figcaption></figure>

### **February\_API\_Testing versus Baseline**

In the February integration test session, one test assertion fails: the consistency of the length of current employment. This information, linked to employment verification outcomes, is obtained through a third-party API. Upon investigation, it's evident that for February, the third-party API indicates all bank customers as unemployed. This highly improbable outcome is surely worthy of investigation; it does not, however, trigger a failure of any of the tests on the probability scores.

<figure><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXcQ21HLB0XHicHVNREJdiCNZTbQo5Qg64WEkAftsPQ-qAmH_uGH90SfHYeGKvV_VR43DlzA_vbYHNZpFaaLA8M_Q4W7Is9-Lp2aJnJIdo7fjuNgT_bHI2WDJ-T1ZDlIWIBUd1yPNzHhcOCGoZospgSTp1Jj?key=O9L8ng0orkd7yT50-H8c7w" alt=""><figcaption><p>Failed Test - GR. 4: API:Employment_Veri.: Length_of_current_employment</p></figcaption></figure>

### **June\_API\_Testing versus Baseline**

During the June integration test, two tests fail. The first failed test indicates that the third-party API, which provides credit history, asserts that all bank customers have lines of credit with other banks. Considering the baseline, this appears highly unlikely.

<figure><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXcDaMmW2wineTJwyqvCrFhJCJACRQWqgQRQAnWfx4xUf_Pco2qBBuk6JQft7ML_fX10PouQGd5LtKvnHOjxk77yuy9VzKSg0RvmFCRc5Nj7SAPH3e6Op0n3mLuikdjH4yAiQ12VbfgJao9CRURK7fYl7BMM?key=O9L8ng0orkd7yT50-H8c7w" alt=""><figcaption><p>Failed Test - Gr. 4: API:Credit_History: Current_Credits</p></figcaption></figure>

The second failed test pertains to the shape of the probability distribution. It becomes evident that the error associated with the third-party API is causing the model to generate inconsistent predictions, which could potentially harm consumers.

<figure><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXeOlqGy_GPu5IxfdQkoznj2P9mqBFmJyxaOLDw-DmzcteZF3v-3Ljn3KmRKMnx8ywfJux_JsybD8up1MKnrojKHivYAorDqftRw337GJhNvmqusWOEJxL5yk68Hai7s6G03wafGUDHDrIt2qz9nM2c6c6KQ?key=O9L8ng0orkd7yT50-H8c7w" alt=""><figcaption><p>Failed Test - Gr. 4: Non Parametric Difference: Probability_Bad</p></figcaption></figure>

### **Conclusion**

In conclusion, during the integration test it is possible to detect if/when different APIs start to show behaviors differently from what is expected. Furthermore, with the extensive set of tests, it is possible to start to determine whether these behavioral changes are expected to cause consumer harm.

## Regression Testing

The second scenario examines a live production system, with each run representing one month of production data. Due to the use of live data, there are some differences between the strategy used for integration testing and the one used for regression testing.

First, note that the Evaluation component is no longer part of the system. This is because, given the live data, each result no longer has a ground truth, hence its exclusion. Consequently, the Outcome Target\_Worthiness for each result merely serves as a placeholder (so as to reuse the run\_config from the integration test.)

<figure><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXfbrLfV8K0y0eUlW6tKAnOLzwseI1UDMKol9jwH-N26p09Kdkji3c7kQbrNS3T1waeKVH9DK8QfjQv0o2kPTX4X3PxRn7ZQoTFp5_yOJ5pKEIZzE6cBf5gBo4pB4opwdBPVE6IU16RJHqwBvTPaNtxNXWQ?key=O9L8ng0orkd7yT50-H8c7w" alt=""><figcaption><p>Non-existing target worthiness</p></figcaption></figure>

Second, as the system operates on live data, each of the runs contains a varying number of results, reflective of the circumstances observed during that month.

<figure><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXfNnfHqq1AtfkWOj3ds06nHpTZqN0qK0oRuqT325Y_xQXUI-dDA_TrX9tQH8V4KvPdnv8FAaO8SVqBWzbnplCRgiPqyK8nD5Oh3CAe8XZJ4F30f-7ld4HW_Bri3rhXmpaZZXhDH9hKurR6WidTK71bRB9NV?key=O9L8ng0orkd7yT50-H8c7w" alt=""><figcaption><p>List of DBNL Runs showing varying number of results</p></figcaption></figure>

Lastly, the set of DBNL tests has been trimmed to exclude those requiring a ground truth. Therefore, instead of the original 31 DBNL tests, the regression testing scenario employs only 21. The groups of tests included in the regression scenario are:

* **Consistency of account/personal information**
* **Consistency of API endpoints**
* **Consistency of credit request**
* **Performance of the system**

Unlike the integration testing scenario that relies on a single baseline, this scenario uses multiple baselines to account for various seasonal trends. Each experimental run is thus tested against the same month from the previous year, the preceding month, and a common baseline.

The objective of the regression test is to verify the consistency of different outcomes introduced into the system through user inputs or third-party APIs.

Upon reviewing the test sessions for the regression tests, it is observed that several tests fail in two of the test sessions. The test sessions with failing tests are those comparing March 2024 to February 2024, and March 2024 to the known baseline. It is, however, noteworthy that all tests pass when comparing March 2024 to March 2023.

<figure><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXc765_ZnEeLHsiznoJ56a9l59kHlrmykiR_JdVTNutnxvF4nDFZFCTy9Rt-sf33xPpVDLy_jzRszIhfLGagcCRw5KkFcKpFZe2eR4r6kMAt6IeJzhd3lBkhmKylechtmTwBDEAdJQGW8t35Ec3GQjwFj7s?key=O9L8ng0orkd7yT50-H8c7w" alt=""><figcaption><p>Completed Test Sessions for Integration Test</p></figcaption></figure>

### **March - 2024 versus February - 2024**

When comparing March 2024 to February 2024, four different tests fail.

The first test to fail examines the age distribution of the group of bank customers who submitted a credit request in March, against the age distribution of the baseline population. The consistency test indicates a change in the age distribution. However, upon visual inspection, this change in distributions could be considered acceptable.  If this were the case, the threshold could be adjusted accordingly.

<figure><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXe_YlQ-2YrVM3fBdwwxJB3xk8i2cIGMcaSrBGy1Ci9QYwZBJVCUqfbz7zwEB3u97FcZVZ1M0fzyPaV1Bv6gw7XmBItyKd-R9QnjGkkAl2VEM0P1unov94XC1Q_uM-LOB2wXMvvoM3HGjGj5IiFCDtmW6dY?key=O9L8ng0orkd7yT50-H8c7w" alt=""><figcaption><p>Failed Test - Gr. 3: Personal_information: Age___years_</p></figcaption></figure>

The second failed test pertains to the gender and marital status of the bank customers. It becomes evident that a significant number of customers who submitted requests in March were single males, compared to the baseline population.

<figure><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXfXEbf-bxFArx7znDfciNfGtaDWqrGHc4zFTW4S_sHV8mRjLE-wtdOQIbOxm_udW81xWD3AROhLx_3mEqSADcd4umYHZDVoiaKfLqKt4Bxs4m6hzp85hAAMboEJiK2GecyNxFT6NmjpwTFqJCGWaZStE41y?key=O9L8ng0orkd7yT50-H8c7w" alt=""><figcaption><p>Failed Test - Gr. 3: Personal_information: Sex___Marital_Status</p></figcaption></figure>

The third failed test pertains to the credit amount of customers submitting credit requests in March. It reveals a difference between the March population and the baseline population. However, upon visual inspection, this test failure should not be a cause for concern.

<figure><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXcTkdpKVS5JmtvCTArtqU4ADIkvjaf8rLmgR8HqXS9aPgiR0CRTsS3LK9tjrCQctPkXtXL8nfM6xH0m2AmC7E5A29HsuYSQejm9dIQDAvDMGpSJZWFPya123kTPMqkNIiUmLA3zD8avV3qsV5dXpuKymd7e?key=O9L8ng0orkd7yT50-H8c7w" alt=""><figcaption><p>Failed Test: Gr. 5: Credit_request: Credit_amount</p></figcaption></figure>

The fourth failed test is associated with the purpose of the credit requests. It's evident that there's a significant difference in the credit purpose between the March 2024 customers and the baseline population.

<figure><img src="https://lh7-us.googleusercontent.com/docsz/AD_4nXeYuXB0WEafK-ftHjyH4DumfRkHMLpAaadQ09DhXS3gQSIQF8gznwhMnqqFWmUNboUF5J5hnvIv0FYpIVAI-BBZqXyLCyT9KdyBOoAYgJ-HIIWaLVRkVhoYL8-pVDOSvXpS8EV4WiS8tuEaKJxt1-yO1Uqn?key=O9L8ng0orkd7yT50-H8c7w" alt=""><figcaption><p>Tailed Test - Gr. 5: Credit_Request: Purpose</p></figcaption></figure>

### **March - 2024 versus Baseline**

When comparing March 2024 to the Baseline, we observe the same four tests failing as mentioned above. These tests pertain to the age distribution, gender and marital status, credit amount, and purpose of the credit requests of the bank customers.

Therefore, the same conclusions from the test sessions can be drawn.

### **Conclusion**

In conclusion, testing against different baseline with different time intervals allows for an easy determination of whether or not behavioral changes are expected.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.dbnl.com/v0.25.x/tutorials/predicting-credit-worthiness-using-tabular-data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
