Can machine learning algorithms help estimate the risk of financial decisions?

Daegi Kang
15 min readOct 29, 2021
Loan Prediction Dataset ML Project

Business Understanding

The primary source of a bank’s operating income is the interest generated by the loan. Banks always deploy risk management to control certain types of risk such as credit risk, interest rate risk, and liquidity risk. Since interest rate risk may be inflexible, banks need to keep credit risk and liquidity risk under careful supervision. The credit risk refers to the counterparty failing to repay the loan interest on time or default on the whole repayment, and the liquidity risk may arise due to the uncollectable debt. Also, lending loans to ‘risky’ applicants is the largest source of financial loss (called credit loss). (Gupta, 2020) If the bank can not correctly make decisions of approving the loan or not, it would definitely harm the bank’s income and exaggerate the loss. Nowadays, top banks in the US like JPMorgan and Bank of America are already using machine learning to detect and prevent risk, and also various banks in India are interested in utilizing machine learning in different fields. (Moccia & here, 2020) Our business goal is to decrease the default rate among customers in banks. Preparing a risk detection model for loan applicants in the bank is likely to decrease the credit risk associated with it. In order to generate the most applicable model for the bank, our team takes the

Following steps:

Firstly, we explore the possible causes of a customers’ default, including the lack of willingness to pay, fluctuating interest rates, competitors of other banks offering lower interest rates, inadequate monitoring of customers’ information, adverse selection, and death of customers. We suggest the bank target younger customers avoid the death problem. For inadequate monitoring and adverse selection, we encourage the bank developing procedures to verify the information customers provided. For the problems with competitors, interest rate, and lack of willingness, our suggestion is to provide discounts on the interest rates to encourage customers to repay timely. Secondly, the bank needs to acquire the right customers. If a bank makes a wrong prediction of customer behavior on default, it would incur two types of losses. First is losses on profit. If a bank predicts the customer will default but actually does not (False Positive), the bank will suffer a loss of business. If a bank predicts the customer will not default but the default (False Negative), this leads to a financial loss of principal and interest generated. We encourage the bank to do segmentation of past applicants and identify any common characteristics shared by applicants in each segment. Finally, based on the applicants’ profile the India bank collected, including demographics and whether customers defaulted or not in the past, we could apply classification models to predict the default behavior of customers. If data analysis is able to identify these risky applicants apart from other applicants, risky loans can be reduced thereby cutting down the amount of credit loss and increasing the operating profit by acquiring customers who do not default.

Data Understanding

By obtaining the dataset from Kaggle that includes historical customer behavior based on observations, and all values were provided at the time of the loan application. (Surana, 2021) In our case, we obtained 252000 training observations with 13 features in total, 12 independent variables, and 1 dependent variable. Independent variables include Profession, Experience, CurrentJobYears, Age, Married, City, State, Income, House_ownership, Car_ownership, and CurrentHouseYears. City, State, Married, House_ownership, Car_ownership, and Profession are all categorical variables. The target variable is Risk_Flag, a categorical variable indicating whether the applicant defaulted in the past. By scouring through the data, we found out that there are no missing values. And then we move on to explore the data, the categorical feature contains city and state applicant lives in, marital status, house and car ownership status, and their profession. The applicants are from 317 cities and 29 states with 51 types of professions. There are several findings that we discovered here:

Correlation matrix

Firstly, we visualized a correlation matrix in R. As we can see from the Matrix, they are not showing strong relationships with each other, despite Experience and Current_Job_Yrs.

Pie chart for Risk Flag

Secondly, we looked into the pie chart for Risk Flag for training data. As we can see from the Pie Chart, the Risk Flag is about 12% which means 12% of the customer’s default on a loan. However, there is a gap with the data obtained from the Board of Governors of the Federal Reserve System (US), where the average default rate fluctuates around 1.45% to 7.5%. In our case, 88% of the customers did not choose to default, which would lead to misleading analysis results. We would potentially have predicted models with uninformative predictions.

Distribution of variables

Thirdly, we noticed that Marital status, House ownership status, and Car ownership are imbalanced as well. Over two hundred thousand of the customers, only 11% of them are married. Besides, we found out that 92% of the customers rented a house, while only 5% of the customers owned a house. Additionally, 70% of the customers don’t own a car, while 30% of them do.

Plotting the distribution of the applicants’ profession frequency plot, the applicants who are physicians, statisticians and web-designers appear covering more percentage among the whole sample, but the percentage of applicants associated with each profession is nearly the same, about 2% per profession in the whole population.

Clustering (Principal Component Analysis)

PCA using R programming
PCA using R programming

There are 5 continuous variables that can be counted into feature variables in our dataset, therefore we do principal component analysis based on Income, Age, Experience, Current job years, and Current house years.

By using PCA, we hope to remove correlated features in our dataset. After running the model, we evaluate these principal components by examining the variance that can be explained by them.

From the summary , we can see that PC1 explained 35% of the total variance, PC2 explained 20% of the total variance, PC3 explained 20% of the total variance, and PC4 explained 19% of the total variance. Therefore, by knowing the position of a sample in relation to PC1, PC2, PC3, and PC4, we can get a generally accurate view of it since these 4 principal components can explain 94% of the total variance. (See Exhibit-2A & Exhibit-2B).

PCA using R programming

Now, we look at scores that PCA computes for each feature to further investigate the interpretation behind each principal component (See Exhibit-3). For each factor, we display the top features that are responsible for 3/5 of the squared norm of the loadings. The first component has a large positive association with the Experience and Current job years of a customer, so we can interpret this component as a primary measurement for a customer’s professional experience.

The second component has a large positive association with Income and Current house years, so this component primarily measures a customer with a higher income level and household stability. The third component has a large negative association with Income and a large positive association with Current house years, thus this component is primarily a measurement for a customer with a lower income level and household stability. The fourth component has a large negative association with Age, so we can interpret this component as a primary measurement for a customer’s age, especially young people. These features we have in our dataset are not highly correlated in the first place, therefore no extra feature can be excluded by using PCA in this case. So we retain all features to do the further modeling.

Data Preparation(Cleaning)

First, we try to find whether there is a missing value for each column, and the result shows our dataset does not contain any null value. Secondly, we made three following changes for string variables: (a). Change Married/Single to dummy variables which 1 represents married and 0 represents single. (b). Create two dummy variables for House_Ownership, and each stands for owned and rented. ©. Change Car_Ownership from “owned/not owned” to dummy variables. (1 = owned, 0 = not owned)

Next, we checked if there were any mismatched data types, all the attributes have the correct data type. Therefore, no further action is needed for data type fixation in the data cleaning process.

Additionally, by digging into the dataset we found out that there are some unreasonable conditions such as people starting to work at the age of 7. Since the legal age to work in India is 14 years old. Therefore, we made assumptions that people start working at least from 14 years old. Moreover, we removed the variable “Id” in our dataset since it does not provide any information of demographics associated with customers’ information. Finally, we removed the duplicated rows in the dataset. Since the default group only contains a small portion of the dataset, using classification on the imbalanced dataset would not produce any significant result. Then we oversample the default group and undersample the non-default group to reach a 1:1 ratio, which makes further analysis more applicable.

Modeling

1. Logistic Regression and Lasso on Logistic Regression:

We have chosen to exclude models with interaction since the correlations among the variables are low, which means the interaction effects are not strong in this case. Since the Y values in our case are 0 and 1, we use the logistic function to predict the binary outcome. Here we select models including Logistic Regression and Lasso on Logistic Regression. Lasso on Logistic Regression is used for enhancing the prediction accuracy and interpretability of the resulting statistical model, by plotting the

CV Lasso using R programming

graph for CV Lasso we add a penalty term with the minimum lambda on Logistic Regression. The pros of Logistic Regression: it is easier to interpret and implement than other models; the prediction parameters indicate the importance of each variable. Since we have 317 categories for the city, it is a demanding task to decide which variable should remain in the logistic regression. Lasso helps to eliminate the insignificant variables by making the coefficient approach zero.

The cons of Logistic Regression: Logistic Regression should be based on the linearity assumption between dependent and independent variables, however, the weak correlation suggests that it might not be able to support the assumption.

2. Decision tree Classification

We have tried using Rstudio to model the classification. Since only 12.3% of customers’ records are associated with default behavior before, this data set is imbalanced. The result for the Decision Tree model in Rstudio only gives one root node, which is not useful in predicting. Also, logistic regression models can not capture the relationship features that interact with each other well. Then classification would be a better choice. For Classification, we do three supervised machine learning models: Firstly, because there are three categorical columns in our data set: Profession, City, and State. We encode them to numerical values with 0 to k-1 values. Because the classification’s algorithm does not process numerical values as regression does, this method is applicable. It applies a cut-off value to determine the average outcome of each node.

Decision Tree using python

The decision tree takes three steps to generate the outcome: first, select the best attribute to split the records, take that attribute to the node and break the data set into smaller subsets. Repeat that step until no more data is left. (Navlani, 2018) We use cross-validation to find the best parameters in the decision tree model first, the optimal choice for the length from leaf to the node is 13 with the Gini criterion. Analyzing the decision tree pattern, the most important feature showing is the city. It implies that if applicants are from certain city areas, the bank should pay attention to it, it may have a higher probability to default than customers from other city areas.

The second important feature is the profession and year living in current house years. However, by looking at the size associated with the default, it is relatively small compared to the size of the group that does not default. This kind of feature may show some certain errors in this modeling. The bank manager could evaluate their decisions by checking the decision tree’s nodes as references. Since we can see explicitly from the result of how it works, it is easy to interpret, and the chart mimics human-level thinking. Even people who do understand the algorithm behind the classification can get insights from it.

The cons of the decision tree are that it is sensitive to noisy data. Since we have a long-length decision tree, it may overfit the training set and takes a long time to run the result. The biggest shortcoming of using a decision tree is that it can be hugely biased with imbalanced data sets. Even though we tried to oversample the data set, it still does not produce satisfactory performances. There are other alternative tree models we can use such as XGBoost and CATBoost.

3. K-Nearest Neighbor

In our case, KNN will analyze a similar pattern in the learned data set and fit the points around all features, and the outcome will be generated by the majority of 3 neighbors around that point. The running speed in training data of utilizing KNN is faster than other classification methods because the rule is simple by calculating the euclidean distance between two points. They can solve the nonlinear pattern because it adopts instance-based modeling. (Navlani 2018) But selecting optimal K for the data set is difficult. Even though we run the cross-validation for checking a range of K and choose the K with the highest accuracy. The smaller k is the more overfit the data. Also, KNN takes a long time in testing data to memorize the distance calculated before. If the test dataset is not similar to the training dataset, the accuracy of prediction will certainly decrease. The alternative choice would be Random Forest or Support Vector Machine.

4. CatBoost

This is a gradient boosting model which gives more weight to wrong predictions. Its process is similar to the decision tree, and it evaluates all possibilities of the decision tree together not recursively stepwise. The model is easy to use, and it does not take a long time to handle a huge data set. Also, it is really useful for a data set that consists of many categorical columns which perfectly fit our data set. The encoding is not just assigning numbers by alphabetical order. It takes random permutation first, then regression or classification, finally changing to numerical values. (Transforming categorical features to numerical features) Also, CatBoost provides the function of plotting the prediction, tree, and the features’ importance.. with the plot of the tree, the bank could deploy their decisions based on those references.

Catboost using python
Catboost using python

However, the algorithm is a black box, for the categorical column which consists of many features, it is hard for us to interpret their numerical value from the tree. (See CATBOOST Decision tree is in Exhibit-4 and features’ importance in Exhibit-5). The alternative methods could be lightGBM and XgBoost, XgBoost’s encoding would allow us to inverse the transform and interpret the decision tree.

Evaluations

Evaluation results

As we can see from the Average OOS R-squared column, the OOS R-squared for Logistic Regression, Lasso on Logistic Regressions and Lasso theory on logistic regression are approximately 0, which means the fitness of these three models are pretty poor. Since there is a low correlation between the variables in our data set and many categorical features existing, OOS R-square would not be a good choice to evaluate the performance. Since our goal is to decrease the default rate, we want our final model to predict the default behavior utmost. Then we choose to use OOS Accuracy, a measure of how accurately an algorithm is able to predict outcome values for previously unseen data. All four logistic models have bad AUC scores with numbers around 0.5, which means the model can not distinguish between positive and negative class points. The 0.5 value for AUC score means it is similar to random guessing, it would not be an optimal choice to deploy any logistic model to predict the default behavior. From evaluating the OOS accuracy of the other three classification models, CATBoost has the highest OOS accuracy, and K-NN is the second highest. These two models both have their own advantages and drawbacks. For the deployment, the bank should consider which one to deploy corresponding to the company’s structure.

Deployment

Clustering Deployment

Before the deployment of classification and predictive models, we were planning to cluster the data. The results from the PCA suggest that we will have four main customer segments, the first with a high level of experience and current job years, the second will have high-income levels with a longer duration of house years, the third will be clustered around low-income levels with a longer duration of house years. Lastly, the fourth cluster will favor younger audiences. Once these clusters are formed, the bank can use them for further predictive modeling and use these separate clusters to create 9 targeted marketing strategies. For instance, text and social media campaigns can be used on cluster four while premium marketing techniques like low rate loans on premium brands can be offered to cluster one.

Prediction Modeling Deployment

Since KNN is the simpler predictive model, the banks will be able to solve various causes of default. For example, to offset the effect of death on default risk, after matching prospective customers within the model, the banks should reach younger audiences through the “Text” channel since texting is the most effective way to communicate with the younger population. Even though we accomplished high OOS accuracy with KNN, it was still in the modeling environment. Once the company deploys this model in the real-world data, the model might degrade which means prediction might become less accurate due to various factors such as data drift, environment changes, etc. Hence, we recommend they automate their model improvement so that they can offset the risk of giving inaccurate predictions and eventually losing money. Also, since we trained and tested only a small part of the company’s data, during the deployment, huge data and high frequency of data intake can be challenging for both data scientists/engineers. Also, since the deployment process will take a lot of time to memorize the distance calculated, we recommend the banks deploy this model way earlier than they start marketing. The Catboost model has the highest OOS accuracy, and it is extremely fast to run but the banks should make sure that they have technically capable employees in the company to deploy and analyze the results of the model since it is a fairly difficult model. After the bank chooses which model to deploy, since all models have the confusion matrix plot with them, the bank could just simply apply the confusion matrix to each customer and assign scores to rank customers. Next, the bank can multiply its cost-benefit matrix with the confusion matrix to calculate the expected profit for this customer based on their assumptions.

Bibliography

Code

https://github.com/richardglankang/Default-Prediction

Authors

--

--