Predicting Credit Worthiness Using Tabular Data

This tutorial provides a comprehensive understanding of dbnl within the scope of tabular data. It guides you through the process of continuous testing of third-party endpoints.

The data files required for this tutorial are available in the following file.

Credit Worthiness Tutorial files

Credit Worthiness Introduction

A bank customer applies for a line of credit; to assess the creditworthiness of the customer, the bank retrieves data from its own warehouse and several third-party API endpoints. This data is then utilized to predict the customer's creditworthiness.

Motivation

Predicting creditworthiness accurately is a cornerstone of banking operations, enabling financial institutions to manage risk effectively and ensure customers are not overburdened with unmanageable credit. With the advent of data-driven decision making, banks can leverage vast amounts of data to make these predictions more effectively.

This tutorial demonstrates the use of on dbnl on tabular data to ensure consistent predictions of creditworthiness through continuous testing of third-party endpoints and testing a live production system. These steps are crucial as predicting creditworthiness involves integrating data from various sources, including a bank's own data warehouse and several third-party API endpoints. Furthermore, in a live environment, the system interacts with real-time data, introducing additional complexities.

Defining the Credit Worthiness System

The creditworthiness prediction system incorporates several third-party endpoints. To prevent potential harm to consumers, it's crucial to regularly test these endpoints for expected behavior. In this tutorial, we conduct monthly tests on these third-party endpoints against a pre-established baseline.

To complete this tutorial, the run config, test payload and data (as stored in runs) are needed. All can be found using the following link. The Credit Worthiness system consists of 11 different components.

The system used throughout the Credit Worthiness demo.
chevron-rightComponent Definitionshashtag

For each component, we are specifying the various columns and identifying the respective owners of the different components.

Component: Credit_request

A bank customer submits a credit request that includes details about the intended purchase, its price, the repayment plan, and whether there's a co-signer for the line of credit.

Columns:

  • Purpose - Purpose for credit

  • Guarantors - Guarantors

  • Instalment_per_cent - Installment %

  • Duration_of_Credit__month_ - Duration of credit per month

  • Credit_Amount - Amount of Credit


Component: Unique_ID

An identifier, used by the bank, facilitates the association of customer information with their internal records and enables data retrieval from third-party applications.

Columns:

  • SSN - Unique ID


Component: Data Warehouse

The bank maintains a repository of customer-related information, which, for simplicity, is categorized into account information and personal information.

In the context of continuous testing, the actual creditworthiness classification for each customer, based on their request, is known.

Columns:

  • Target_Worthiness - True classes


Component: Account Information

Subset of the Data Warehouse containing information about the value of the assets in a customer account.

Columns:

  • Account_Balance - Account balance

  • Value_Savings_Stocks - Savings/stock value


Component: Personal Information

Subset of the Data Warehouse containing personal information about a customer of the bank.

Columns:

  • Sex___Marital_Status - Sex/Marital status

  • Most_valuable_available_asset - Most valuable available asset

  • Age__years_ - Age (years)

  • No_of_dependents - Number of dependents


Component: API:Credit_History

Third party API providing the information about credit history for a given bank customer.

Columns:

  • Payment_Status_of_Previous_Credit - Payment Status

  • No_of_Credits_at_this_Bank - Number of credits at this Bank


Component: API:Credit_Report

Third party API providing a FICO (credit) score for a given bank customer.

Columns:

  • FICO_score - FICO score


Component: API:Employment_Veri. (verification)

Third party API providing employment verification for a given bank customer.

Columns:

  • Length_of_current_employment - Length of current employment

  • Foreign_Worker - Foreign worker

  • Occupation - Occupation


Component: API:Rental_History

Third party API providing information about the rental history of a given customer.

Columns:

  • Duration_in_Current_address - Duration in current address

  • Type_of_apartment - Type of apartment


Component: XGB:Classifier

Output of the XGBoost classifier used to predict whether a line of credit should get approved for a given customer.

Columns:

  • Predicted_Worthiness - Predicted classes

  • Probability_Bad - Probability for class bad

  • Probability_Good - Probability for class good

  • Latency__ms - Latency for the model


Component: Evaluation

Run-level metrics coming from comparing the predicted credit worthiness class to the true credit worthiness class.

Scalars:

  • Model_Accuracy - Accuracy for the model

  • Model_F1 - F1-score for the model

  • Model_Precision - Precision for the model

  • Model_Recall - Recall for the model

chevron-rightExample Run Configuration - find complete run configuration in the tutorial zip filehashtag

Example Column Configuration

Example Scalar Configuration

Example Component DAG

Creating dbnl Tests

To test the credit worthiness system, 7 different groupings of tests are used. These are denoted using the dbnl test taggingarrow-up-right strategy.

Change in probability

This set of tests pertains to the distribution shape of the probability scores. Utilizing the scaled Kolmogorov-Smirnov statisticarrow-up-right, the test is deemed unsuccessful if the difference between the distribution shapes of the baseline and experimental probabilities exceeds 0.5.

Test configuration
chevron-rightExample Test Payload - find full list of test payloads in the tutorial zip filehashtag

Shift in Probability

This set of tests examines the shift in probabilityarrow-up-right between the baseline and the experiment. The test is considered unsuccessful if over 10% of the results exhibit a shift greater than 0.1 (10%) when comparing the baseline probabilities with the experimental probabilities.

chevron-rightExample Test Payload - find full list of test payloads in the tutorial zip filehashtag

Minimum Performance

This set of tests pertains to the minimum performance thresholdarrow-up-right of the experimental model. The test is deemed unsuccessful if the performance of any run-level metrics falls below 0.8 (80%).

Test configuration
circle-info

Since run-level data likeModel_Accuracy are scalar value. We use the scalar statistic name to indicate that we are comparing the value itself to the assertion.

chevron-rightExample Test Payload - find full list of test payloads in the tutorial zip filehashtag

Relative performance

This set of tests is concerned with the relative performance of the run-level metrics. A test is considered unsuccessful if there's a performance shift greater than 0.1 (10%) for any of the run-level metrics when comparing a baseline run to an experimental run.

Test configuration
chevron-rightExample Test Payload - find full list of test payloads in the tutorial zip filehashtag

Consistency of account/personal information

This set of tests pertains to the consistency of outcomes associated with either account or personal information. A test is deemed unsuccessful if the difference, calculated using either scaled Kolmogorov-Smirnov or scaled chi-squared statisticsarrow-up-right, between baseline and experimental outcomes exceeds 0.55.

Test configuration
chevron-rightExample Test Payload - find full list of test payloads in the tutorial zip filehashtag

Consistency of API endpoints

This set of tests focuses on the consistency of outcomes associated with different third-party API endpoints. A test is considered unsuccessful if the difference, calculated using either scaled Kolmogorov-Smirnov or scaled chi-squared statisticsarrow-up-right, between baseline and experimental outcomes exceeds 0.55.

Test configuration
chevron-rightExample Test Payload - find full list of test payloads in the tutorial zip filehashtag

Consistency of credit request

This set of tests pertains to the consistency of outcomes associated with credit requests. A test is deemed unsuccessful if the difference, calculated using either scaled Kolmogorov-Smirnov or scaled chi-squared statisticsarrow-up-right, between baseline and experimental outcomes exceeds 0.55.

Test configuration
chevron-rightExample Test Payload - find full list of test payloads in the tutorial zip filehashtag

Performance of the system

This set of tests focuses on the consistency of outcomes related to the system's performance. A test is considered unsuccessful if the difference, calculated using either scaled Kolmogorov-Smirnov or scaled chi-squared statisticsarrow-up-right, between baseline and experimental outcomes exceeds 0.55.

Test configuration
chevron-rightExample Test Payload - find full list of test payloads in the tutorial zip filehashtag

Integration Testing

The objective of the integration test is to detect and alert when the third-party API endpoints begin to exhibit different behavior. This involves conducting system tests against a known baseline every month, or at whatever cadence is appropriate.

Given the aim of ensuring consistency across different third-party API endpoints, the same dataset is used to establish the baseline and conduct the monthly experiment.

After nine months of running the integration test, it has been observed that several tests have failed twice.

Completed Test Sessions for Integration Test

February_API_Testing versus Baseline

In the February integration test session, one test assertion fails: the consistency of the length of current employment. This information, linked to employment verification outcomes, is obtained through a third-party API. Upon investigation, it's evident that for February, the third-party API indicates all bank customers as unemployed. This highly improbable outcome is surely worthy of investigation; it does not, however, trigger a failure of any of the tests on the probability scores.

Failed Test - GR. 4: API:Employment_Veri.: Length_of_current_employment

June_API_Testing versus Baseline

During the June integration test, two tests fail. The first failed test indicates that the third-party API, which provides credit history, asserts that all bank customers have lines of credit with other banks. Considering the baseline, this appears highly unlikely.

Failed Test - Gr. 4: API:Credit_History: Current_Credits

The second failed test pertains to the shape of the probability distribution. It becomes evident that the error associated with the third-party API is causing the model to generate inconsistent predictions, which could potentially harm consumers.

Failed Test - Gr. 4: Non Parametric Difference: Probability_Bad

Conclusion

In conclusion, during the integration test it is possible to detect if/when different APIs start to show behaviors differently from what is expected. Furthermore, with the extensive set of tests, it is possible to start to determine whether these behavioral changes are expected to cause consumer harm.

Regression Testing

The second scenario examines a live production system, with each run representing one month of production data. Due to the use of live data, there are some differences between the strategy used for integration testing and the one used for regression testing.

First, note that the Evaluation component is no longer part of the system. This is because, given the live data, each result no longer has a ground truth, hence its exclusion. Consequently, the Outcome Target_Worthiness for each result merely serves as a placeholder (so as to reuse the run_config from the integration test.)

Non-existing target worthiness

Second, as the system operates on live data, each of the runs contains a varying number of results, reflective of the circumstances observed during that month.

List of DBNL Runs showing varying number of results

Lastly, the set of DBNL tests has been trimmed to exclude those requiring a ground truth. Therefore, instead of the original 31 DBNL tests, the regression testing scenario employs only 21. The groups of tests included in the regression scenario are:

  • Consistency of account/personal information

  • Consistency of API endpoints

  • Consistency of credit request

  • Performance of the system

Unlike the integration testing scenario that relies on a single baseline, this scenario uses multiple baselines to account for various seasonal trends. Each experimental run is thus tested against the same month from the previous year, the preceding month, and a common baseline.

The objective of the regression test is to verify the consistency of different outcomes introduced into the system through user inputs or third-party APIs.

Upon reviewing the test sessions for the regression tests, it is observed that several tests fail in two of the test sessions. The test sessions with failing tests are those comparing March 2024 to February 2024, and March 2024 to the known baseline. It is, however, noteworthy that all tests pass when comparing March 2024 to March 2023.

Completed Test Sessions for Integration Test

March - 2024 versus February - 2024

When comparing March 2024 to February 2024, four different tests fail.

The first test to fail examines the age distribution of the group of bank customers who submitted a credit request in March, against the age distribution of the baseline population. The consistency test indicates a change in the age distribution. However, upon visual inspection, this change in distributions could be considered acceptable. If this were the case, the threshold could be adjusted accordingly.

Failed Test - Gr. 3: Personal_information: Age___years_

The second failed test pertains to the gender and marital status of the bank customers. It becomes evident that a significant number of customers who submitted requests in March were single males, compared to the baseline population.

Failed Test - Gr. 3: Personal_information: Sex___Marital_Status

The third failed test pertains to the credit amount of customers submitting credit requests in March. It reveals a difference between the March population and the baseline population. However, upon visual inspection, this test failure should not be a cause for concern.

Failed Test: Gr. 5: Credit_request: Credit_amount

The fourth failed test is associated with the purpose of the credit requests. It's evident that there's a significant difference in the credit purpose between the March 2024 customers and the baseline population.

Tailed Test - Gr. 5: Credit_Request: Purpose

March - 2024 versus Baseline

When comparing March 2024 to the Baseline, we observe the same four tests failing as mentioned above. These tests pertain to the age distribution, gender and marital status, credit amount, and purpose of the credit requests of the bank customers.

Therefore, the same conclusions from the test sessions can be drawn.

Conclusion

In conclusion, testing against different baseline with different time intervals allows for an easy determination of whether or not behavioral changes are expected.

Was this helpful?