Predicting Credit Worthiness Using Tabular Data
This tutorial provides a comprehensive understanding of dbnl within the scope of tabular data. It guides you through the process of continuous testing of third-party endpoints.
The data files required for this tutorial are available in the following file.
Credit Worthiness Introduction
A bank customer applies for a line of credit; to assess the creditworthiness of the customer, the bank retrieves data from its own warehouse and several third-party API endpoints. This data is then utilized to predict the customer's creditworthiness.
Motivation
Predicting creditworthiness accurately is a cornerstone of banking operations, enabling financial institutions to manage risk effectively and ensure customers are not overburdened with unmanageable credit. With the advent of data-driven decision making, banks can leverage vast amounts of data to make these predictions more effectively.
This tutorial demonstrates the use of on dbnl on tabular data to ensure consistent predictions of creditworthiness through continuous testing of third-party endpoints and testing a live production system. These steps are crucial as predicting creditworthiness involves integrating data from various sources, including a bank's own data warehouse and several third-party API endpoints. Furthermore, in a live environment, the system interacts with real-time data, introducing additional complexities.
Defining the Credit Worthiness System
The creditworthiness prediction system incorporates several third-party endpoints. To prevent potential harm to consumers, it's crucial to regularly test these endpoints for expected behavior. In this tutorial, we conduct monthly tests on these third-party endpoints against a pre-established baseline.
To complete this tutorial, the run config, test payload and data (as stored in runs) are needed. All can be found using the following link. The Credit Worthiness system consists of 11 different components.
Component Definitions
For each component, we are specifying the various columns and identifying the respective owners of the different components.
Component: Credit_request
A bank customer submits a credit request that includes details about the intended purchase, its price, the repayment plan, and whether there's a co-signer for the line of credit.
Columns:
Purpose - Purpose for credit
Guarantors - Guarantors
Instalment_per_cent - Installment %
Duration_of_Credit__month_ - Duration of credit per month
Credit_Amount - Amount of Credit
Component: Unique_ID
An identifier, used by the bank, facilitates the association of customer information with their internal records and enables data retrieval from third-party applications.
Columns:
SSN - Unique ID
Component: Data Warehouse
The bank maintains a repository of customer-related information, which, for simplicity, is categorized into account information and personal information.
In the context of continuous testing, the actual creditworthiness classification for each customer, based on their request, is known.
Columns:
Target_Worthiness - True classes
Component: Account Information
Subset of the Data Warehouse containing information about the value of the assets in a customer account.
Columns:
Account_Balance - Account balance
Value_Savings_Stocks - Savings/stock value
Component: Personal Information
Subset of the Data Warehouse containing personal information about a customer of the bank.
Columns:
Sex___Marital_Status - Sex/Marital status
Most_valuable_available_asset - Most valuable available asset
Age__years_ - Age (years)
No_of_dependents - Number of dependents
Component: API:Credit_History
Third party API providing the information about credit history for a given bank customer.
Columns:
Payment_Status_of_Previous_Credit - Payment Status
No_of_Credits_at_this_Bank - Number of credits at this Bank
Component: API:Credit_Report
Third party API providing a FICO (credit) score for a given bank customer.
Columns:
FICO_score - FICO score
Component: API:Employment_Veri. (verification)
Third party API providing employment verification for a given bank customer.
Columns:
Length_of_current_employment - Length of current employment
Foreign_Worker - Foreign worker
Occupation - Occupation
Component: API:Rental_History
Third party API providing information about the rental history of a given customer.
Columns:
Duration_in_Current_address - Duration in current address
Type_of_apartment - Type of apartment
Component: XGB:Classifier
Output of the XGBoost classifier used to predict whether a line of credit should get approved for a given customer.
Columns:
Predicted_Worthiness - Predicted classes
Probability_Bad - Probability for class bad
Probability_Good - Probability for class good
Latency__ms - Latency for the model
Component: Evaluation
Run-level metrics coming from comparing the predicted credit worthiness class to the true credit worthiness class.
Scalars:
Model_Accuracy - Accuracy for the model
Model_F1 - F1-score for the model
Model_Precision - Precision for the model
Model_Recall - Recall for the model
Example Run Configuration - find complete run configuration in the tutorial zip file
Example Column Configuration
Example Scalar Configuration
Example Component DAG
Creating dbnl Tests
To test the credit worthiness system, 7 different groupings of tests are used. These are denoted using the dbnl test tagging strategy.
Change in probability
This set of tests pertains to the distribution shape of the probability scores. Utilizing the scaled Kolmogorov-Smirnov statistic, the test is deemed unsuccessful if the difference between the distribution shapes of the baseline and experimental probabilities exceeds 0.5.
Shift in Probability
This set of tests examines the shift in probability between the baseline and the experiment. The test is considered unsuccessful if over 10% of the results exhibit a shift greater than 0.1 (10%) when comparing the baseline probabilities with the experimental probabilities.

Minimum Performance
This set of tests pertains to the minimum performance threshold of the experimental model. The test is deemed unsuccessful if the performance of any run-level metrics falls below 0.8 (80%).

Since run-level data likeModel_Accuracy are scalar value. We use the scalar statistic name to indicate that we are comparing the value itself to the assertion.
Relative performance
This set of tests is concerned with the relative performance of the run-level metrics. A test is considered unsuccessful if there's a performance shift greater than 0.1 (10%) for any of the run-level metrics when comparing a baseline run to an experimental run.

Consistency of account/personal information
This set of tests pertains to the consistency of outcomes associated with either account or personal information. A test is deemed unsuccessful if the difference, calculated using either scaled Kolmogorov-Smirnov or scaled chi-squared statistics, between baseline and experimental outcomes exceeds 0.55.
Consistency of API endpoints
This set of tests focuses on the consistency of outcomes associated with different third-party API endpoints. A test is considered unsuccessful if the difference, calculated using either scaled Kolmogorov-Smirnov or scaled chi-squared statistics, between baseline and experimental outcomes exceeds 0.55.
Consistency of credit request
This set of tests pertains to the consistency of outcomes associated with credit requests. A test is deemed unsuccessful if the difference, calculated using either scaled Kolmogorov-Smirnov or scaled chi-squared statistics, between baseline and experimental outcomes exceeds 0.55.
Performance of the system
This set of tests focuses on the consistency of outcomes related to the system's performance. A test is considered unsuccessful if the difference, calculated using either scaled Kolmogorov-Smirnov or scaled chi-squared statistics, between baseline and experimental outcomes exceeds 0.55.
Integration Testing
The objective of the integration test is to detect and alert when the third-party API endpoints begin to exhibit different behavior. This involves conducting system tests against a known baseline every month, or at whatever cadence is appropriate.
Given the aim of ensuring consistency across different third-party API endpoints, the same dataset is used to establish the baseline and conduct the monthly experiment.
After nine months of running the integration test, it has been observed that several tests have failed twice.
February_API_Testing versus Baseline
In the February integration test session, one test assertion fails: the consistency of the length of current employment. This information, linked to employment verification outcomes, is obtained through a third-party API. Upon investigation, it's evident that for February, the third-party API indicates all bank customers as unemployed. This highly improbable outcome is surely worthy of investigation; it does not, however, trigger a failure of any of the tests on the probability scores.
June_API_Testing versus Baseline
During the June integration test, two tests fail. The first failed test indicates that the third-party API, which provides credit history, asserts that all bank customers have lines of credit with other banks. Considering the baseline, this appears highly unlikely.
The second failed test pertains to the shape of the probability distribution. It becomes evident that the error associated with the third-party API is causing the model to generate inconsistent predictions, which could potentially harm consumers.
Conclusion
In conclusion, during the integration test it is possible to detect if/when different APIs start to show behaviors differently from what is expected. Furthermore, with the extensive set of tests, it is possible to start to determine whether these behavioral changes are expected to cause consumer harm.
Regression Testing
The second scenario examines a live production system, with each run representing one month of production data. Due to the use of live data, there are some differences between the strategy used for integration testing and the one used for regression testing.
First, note that the Evaluation component is no longer part of the system. This is because, given the live data, each result no longer has a ground truth, hence its exclusion. Consequently, the Outcome Target_Worthiness for each result merely serves as a placeholder (so as to reuse the run_config from the integration test.)
Second, as the system operates on live data, each of the runs contains a varying number of results, reflective of the circumstances observed during that month.
Lastly, the set of DBNL tests has been trimmed to exclude those requiring a ground truth. Therefore, instead of the original 31 DBNL tests, the regression testing scenario employs only 21. The groups of tests included in the regression scenario are:
Consistency of account/personal information
Consistency of API endpoints
Consistency of credit request
Performance of the system
Unlike the integration testing scenario that relies on a single baseline, this scenario uses multiple baselines to account for various seasonal trends. Each experimental run is thus tested against the same month from the previous year, the preceding month, and a common baseline.
The objective of the regression test is to verify the consistency of different outcomes introduced into the system through user inputs or third-party APIs.
Upon reviewing the test sessions for the regression tests, it is observed that several tests fail in two of the test sessions. The test sessions with failing tests are those comparing March 2024 to February 2024, and March 2024 to the known baseline. It is, however, noteworthy that all tests pass when comparing March 2024 to March 2023.
March - 2024 versus February - 2024
When comparing March 2024 to February 2024, four different tests fail.
The first test to fail examines the age distribution of the group of bank customers who submitted a credit request in March, against the age distribution of the baseline population. The consistency test indicates a change in the age distribution. However, upon visual inspection, this change in distributions could be considered acceptable. If this were the case, the threshold could be adjusted accordingly.
The second failed test pertains to the gender and marital status of the bank customers. It becomes evident that a significant number of customers who submitted requests in March were single males, compared to the baseline population.
The third failed test pertains to the credit amount of customers submitting credit requests in March. It reveals a difference between the March population and the baseline population. However, upon visual inspection, this test failure should not be a cause for concern.
The fourth failed test is associated with the purpose of the credit requests. It's evident that there's a significant difference in the credit purpose between the March 2024 customers and the baseline population.
March - 2024 versus Baseline
When comparing March 2024 to the Baseline, we observe the same four tests failing as mentioned above. These tests pertain to the age distribution, gender and marital status, credit amount, and purpose of the credit requests of the bank customers.
Therefore, the same conclusions from the test sessions can be drawn.
Conclusion
In conclusion, testing against different baseline with different time intervals allows for an easy determination of whether or not behavioral changes are expected.
Was this helpful?

