Credit Risk Use Case
The credit risk use cases enables users to learn and see end-to-end credit risk model development involving machine learning systems. The use case further enables users to gain more insights into how machine learning models applied to loan performance data operate and arrived at a certain output.
Data Processing & Data Quality
Step 1: Initial Data Processing
In supervised predictive modelling, algorithms learn by mapping input factors to a target variable. We cannot fit and evaluate machine learning models on raw data but rather datasets have to be processed and transformed so to fit the requirements of individual machine learning models. Even more importantly, we need to identify a representation of the data that best exposes the underlining patters between the input and output features, given a specific machine learning model. As a result, one of the main and most challenging tasks of any machine learning project is the data pre-processing step. Specifically, there is an interplay between the data and the choice of algorithms:
-
Some algorithms cannot deal with missing observations
-
Some algorithms assume each variable (in some cases including the target feature), to have a specific probability distribution
-
Some algorithms are negatively impacted if two or more input variables are highly correlated.
-
Some algorithms are known to perform worse if there are input variables that are irrelevant or redundant to the target variable
-
Some algorithms have very few requirements concerning the input data, but in turn, may require many examples in order to learn how to make good predictions
In this project, we developed an automated data pre-processing step. Figure 1 provides the actions performed within this step.
Figure 1. Automated data pre-processing steps
Step 2: Data Quality
Following, the initial data pre-processing step, we continue the feature selection process.
-
Dealing with missing features:
-
In the initial data pre-processing step, we created additional columns to the data frame for all numerical features that have at least one missing value. The objective of this step is to, consequently, investigate whether the existence of missing values is associated with our target feature (the status of the loan, i.e. whether the loan defaulted or not). For this purpose, we run a Chi-Square test. The null hypothesis of the Chi-Square test is that there is no relationship between the considered variables, whereas the alternative hypothesis assumes that there is an association between the two variables.
-
Results: For all 112 variables the p-value was > 0.05 hence we can conclude that the existence of missing observations is not associated with our target
-
Once we have reasonable evidence that the missing values are not associated with out target, we carry-out a 2-step process to deal with columns missing information for certain loan contracts
-
Step 1: we cancel columns with over 50% missing observations
-
Step 2: we apply the row-wise complete cases function (i.e. we cancel all loan contracts for which at least one feature contains a missing value)
-
-
-
-
Dealing with highly correlated features: we calculate a correlation matrix and we remove features that are highly correlated (formally: correlation coefficient is > 80%). In terms of the specific choice of which variable to keep, we apply the root-stem approach, meaning for each pari of highly correlated features, we subjectively decide which variable is the "root" and should be kept.
-
Factor screening: we carry-out an in-depth analysis of the factor variables included in the dataset. All features for which we observe no variability, are removed from further analysis (ex. hardship_type has only one level: interest only-3 months deferral.
Figure 2. Feature selection (step 1)
Step 3: Feature Selection
For all data driven models, feature selection can significantly affect model performance. Well-designed features increase models' flexibility and robustness. The literature distinguishes between several different techniques for feature selection:
-
Embedded methods: where feature selection is an integral part of the ML algorithm
-
Filter methods: where each feature is assigned a score based on a specific statistical procedure
-
Wrapper methods: where we compare the predictive utility of ML models that are trained on different coalition of features
-
Hybrid methods: where we combine at least two of the above mentioned techniques
In the context of this project, we applied the Boruta algorithm which arises from the spirit of random forest and further adds randomness to the system. The main idea behind the Boruta algorithm is quite straightforward: we make a randomised copy of the system, merge the copy with the original input features and build the classifier for this extended system. To asses importance of the variable in the original system, we compare it with that of the randomised variables. Only variables for which importance is higher than that of the randomised variables are classified as important (Figure 3).
The applied procedure is as follows (Miron Kursa et al. 2010):
-
We build an extended system, with replicated variables which are then randomly permuted. As a result, all correlations between the replicated and original variables are random by design;
-
We perform several random forest runs
-
For each run, we compute the importance of all attributes.
-
The attribute is deemed important for a single run if its importance is higher than maximal importance of all randomised attributes.
-
We perform a statistical test for all attributes. The null hypothesis is that importance of the variable is equal to the maximal importance of the random attributes (MIRA). The test is a two-sided equality test – the hypothesis may be rejected either when importance of the attribute is significantly higher or significantly lower than MIRA. For each attribute we count how many times the importance of the attribute was higher than MIRA (a hit is recorded for the variable).
-
Variables which are deemed unimportant are removed from the information system, usually with their randomised mirror pair.
-
The procedure is performed for predefined number of iterations, or until all attributes are either rejected or conclusively deemed important, whichever comes first
Figure 3. Visual representation of the steps of the Boruta algorithm. Source: Hasan et al.
ML & DL Models
In this project, we rely on the AutoML interface from the h2o package which is a comprehensive framework for training and testing ML models in both R and Python. The algorithms we train and test are as follows:
-
DRF (This includes both the Distributed Random Forest (DRF) and Extremely Randomised Trees (XRT) models.)
-
GLM (Generalised Linear Model with regularisation)
-
XGBoost (XGBoost GBM)
-
GBM (H2O GBM)
-
DeepLearning (Fully-connected multi-layer artificial neural network)
-
StackedEnsemble (Stacked Ensembles, includes an ensemble of all the base models and ensembles using subsets of the base models)
Below we provide brief overview of the different models employed.
Random Forest & Extremely Randomized Trees
A random forest algorithm is a classification and regression model that builds decision trees on different samples and takes their majority vote for a classification task and the average if the task in question is a regression one. The algorithm was originally proposed by Breiman (2001) as an improvement of the Tree Bagging approach. In order to build a tree, the algorithm uses a bootstrap replica of the learning sample, and the CART algorithm together with the modification used in the Random Subspace method. At each test node the optimal split is derived by searching a random subset of size K of candidate attributes (selected without replacement from the candidate attributes).
The extra-trees algorithm is proposed by Geurts (2005) and it builds an ensemble of unpruned classification or regression trees according to the classical top-down procedure. Its two main differences with other tree-based ensemble methods are that it splits nodes by choosing cut-points fully at random and that it uses the whole learning sample (rather than a bootstrap replica) to grow the trees.
Extreme Gradient Boosting
The XGBoost framework was proposed by Chen and Guestrin (2016). As stated by the authors’ themselves,
« …the most important factor behind the success of XGBoost is its scalability in all scenarios. The system runs more than ten times faster than existing popular solutions on a single machine and scales to billions of examples in distributed or memory-limited settings. The scalability of XGBoost is due to several important systems and algorithmic optimizations. These innovations include: a novel tree learning algorithm is for handling sparse data; a theoretically justified weighted quantile sketch procedure enables handling instance weights in approximate tree learning. Parallel and distributed computing makes learning faster which enables quicker model exploration. More importantly, X computation and enables data scientists to process hundred millions of examples on a desktop. Finally, it is even more exciting to combine these techniques to make an end-to-end system that scales to even larger data with the least amount of cluster resources. »
Generalized Linear Models (GLM)
GLM covers a larger group of models that were initially made popular by McCullagh and Nelder (1982). In this class of approaches, the response variable y_i is assumed to follow an exponential family distribution with mean _i, which is assumed to be some (often nonlinear) function of x_i^T β. Some would call these “nonlinear” because is often a nonlinear function of the covariates.
Within this class of models, the project uses a binary logistic regression which in essence models how the odds of default on the loan contract depend on a set of explanatory features. Mathematically, the model is represented as follows:
logit(π_i )=log(π_i/(1-π_i ))+β_0+βx_i
In this context, the distribution of the dependent variable is assumed to be binomial with a “default” probability E(Y)=π. X represents the input space of continues or discreate features which are linear in the parameters. Finally, the link function, i.e. the logit link used is:
ŋ=g(π)=log(π_i/(1-π_i )
Gradient Boosting (GBM)
Boosting algorithms were originally introduced by the machine learning community (Schapire 1990; Freund 1995; Freund and Schapire 1996) for classification problems. The gradient boosting machine (GBM) essentially is an ensemble learning method, which constructs a predictive model by additive expansion of sequentially fitted weak learners. Whereas random forests build an ensemble of deep independent trees, GBMs build an ensemble of shallow trees in sequence with each tree learning and improving on the previous one. Although shallow trees by themselves are rather weak predictive models, they can be “boosted” to produce a powerful “committee” that, when appropriately tuned, is often hard to beat with other algorithms.
The GBM method is also considered as a numerical optimization algorithm that tries to find an additive model that minimizes the loss function. Hence, at each step, the GBM algorithm iteratively adds new decision tree (what is called a “weak learner”), that leads to the greatest reduction of the loss function. As further explained by Touzani et al. (2018), "... in regression, the algorithm starts by initializing the model by a first guess, which is usually a decision tree that maximally reduces the loss function (which is for regression the mean squared error), then at each step a new decision tree is fitted to the current residual and added to the previous model to update the residual. The algorithm continues to iterate until a maximum number of iterations, provided by the user, is reached. This process is so-called stage wise, meaning that at each new step the decision trees added to the model at prior steps are not modified. By fitting decision trees to the residuals the model is improved in the regions where it does not perform well."
Deep Learning
Within the last 2 decades, deep structured learning, or more frequently known as deep learning or hierarchical learning, has emerged as a new area of machine learning research. It is considered part of a broader family of machine learning approaches that are based on artificial neural networks with representation learning. In terms of the development of this technology, Microsoft applied the Gartner hyper cycle to the artificial neural network development and created Figure 4 to align different generations of the neural network with the various phases designated in the hype cycle.
As represented in Figure 4, the hype around this technology started in the early 1980s but it wasn't until 2006 when this technology led the learning becoming highly effective.
In such frameworks, data is passed through many layers of architecture, each able to extract patterns and features and pass them to the next layer. The initial layers in a deep learning architecture typically extracts low-level features, and succeeding layers combines features to form a complete representation. Figure 5 provides a visual representation of the various deep learning models that are widely used for cutting-edge computing applications.
Stacked Models
Ensemble machine learning methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms. Many of the popular modern machine learning algorithms are actually ensembles. For example, Random Forest and Gradient Boosting Machine (GBM) are both ensemble learners. Both bagging (e.g. Random Forest) and boosting (e.g. GBM) are methods for ensembling that take a collection of weak learners (e.g. decision tree) and form a single, strong learner.
H2O’s Stacked Ensemble method is a supervised ensemble machine learning algorithm that finds the optimal combination of a collection of prediction algorithms using a process called stacking. Like all supervised models in H2O, Stacked Ensemeble supports regression, binary classification, and multiclass classification.
Figure 4. Microsoft applying Gartner hyper cycle graph to analyzing the history of artificial neural network technology. Source: Deng and Yu (2013)
Figure 5. The structures of different deep learning models. Source: Wang et al. (2016)
Explainability of ML Models
In this project, our objectives concerning the explainability component are two-fold:
-
to explore the utility of classic XAI methods as applied to financial problem sets
-
to propose new methods that provide insights into the inner working on ML models specifically suited for financial data
Concerning the first objective, the build VA tool allows users to observe various global and local explanations concerning the different ML and DL models build on the credit risk use case. In terms of the difference, global explanations allow users to understand how to model works in a general sense i.e. which features contribute most to the model's decision when we consider all predictions made. Global explanations are particularly useful for a non-technical audience as it can give them first hand information what the model, on average and across all the data, uses to make decision. On the other hand, local explanation provide an answer to the question: "for this specific loan contract, why did the model make this decision?" Local explanations are especially relevant for end-users who are mostly interested in their specific case and in case of an unfavorable decision - what they need to change in the future in order to get a different outcome.
The VA tool developed by the team allow users access to both types of explanations. Below we summarize the different classic XAI methods that are deployed in the developed app.
Explanations enabled on the VA tool:
-
(Global) Variable Importance Plot. This visualization presents the relative importance of the different input features included in the model specification. Within our VA, this plot is available for all ML models apart from the ensemble models. The relative influence of each feature is determined by examining the tree building process and checking whether the variable in question was selected in a splitting criteria and if as a result, the squared error of all trees decreased.
-
(Global) Partial Dependence Plot. The PDPs help visualize the relationship between a subset of the features and the response while accounting for the average effect of the other predictors in the model. For numerical data, the PD-based feature importance is defined as a deviation of each unique feature value from the average curve. Key assumption of the PDPs is feature independence i.e. the approach assumes that the variables for which the partial dependence is computed are not correlated with other features.
-
-
(Global) SHAP Summary. SHAP, short for SHapley Additive exPlanations, presents a unified framework for interpreting predictions and it is based on the game theoretically optimal Shapley values.. According to the paper by [32], for each prediction instance, SHAP assigns an importance score for each feature included in the model’s specification. Its novel components include: (i) the identification of a new class of additive feature importance measures, and (ii) theoretical results showing there is a unique solution in this class with a set of desirable properties. The goal of SHAP is to explain the prediction of an instance x by computing the contribution of each feature to the prediction.
Compared to the variable importance plot, the SHAP values provide the direction and the magnitude of the impact of a certain feature on the model's response. Figure provides an example of a summary SHAP plot that is incorporated in the developed VA tool.
In addition to the visual representation of the different global explanations, the VA tool also allows users to access the interpretation of the different techniques from a financial perspective.
-
(local) SHAP row-specific explanations: This method shows contribution of features for a given instance. The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function. H2O implements TreeSHAP which when the features are correlated, can increase contribution of a feature that had no influence on the prediction.
Figure 6. An example of a variable importance plot for a trained GBM model
Figure 7. An example of a PDP plot for a trained DL model [left panel: explanation for a factor variable, employment length; right panel: explanation for a number variable, loan amount]
Figure 10. SHAP contributions for a specific loan contract
Figure 9. SHAP contributions for a trained DRF model
Figure 8. SHAP (SHapley Additive exPlanation) values attribute to each feature the change in the expected model prediction when conditioning on that feature
Stability of Predictions
A key objective from our project is to investigate the robustness and stability of predictions obtained by various state-of-art ML models. In this context, we run a sensitivity analysis which users can access though the VA tool. The specific goal in this context is to understand how uncertain an output is within a certain mathematical model by considering how changes in the inputs affect the obtained predicted probabilities.
The process of recalculating outputs under the alternative assumptions so to determine the impact of the variable can be summarized in the following steps:
-
Step 1. Using the original data, an ML object is fitted
-
Step 2. We select a variable from the original input space and we add different levels of noise to it. The noise is determined in the following manner: let's assume the selected variable is marked x and we take z = max(x) - min(x), the level of noise added, x + runif(n, -a, +a), is defined though the changing of a, where n is the number of observations in x and a, on the other hand, can be defined as:
-
a = z/50;
-
(1.1; 1.5; 2.5; 5) x z/50
-
if d = smallest difference between adjacent unique x values, a = d/5
-
a = (1.1; 1.5; 2.5; 5) x d/5
-
-
Step 3. Using the data with added noise to one selected variable, we refit the same specification of the ML object and we print:
-
the positive or negative change in the overall predictive utility of the model (AUC and AUCPR)
-
the mean change in the variable
-
the mean change in the predicted probability of default
-
the min change in the predicted probability of default
-
the max change in the predicted probability of default
-
the number of class changes that have happened due to the change in the variable
-
the correlation coefficient between the change in the variable and the change in the predicted probability
-
the regression coefficient from a fitted ordinary lease square (OLS) model (where the dependent variable, y, is the change in the prediction and the independent variable, x, is the change in the variable
-
the scatter plot with an added smooth (y = change in the prediction ~ x = change in the variable)
-
In addition to the printed table, the tool also enables users to access the final evaluation and interpretation concerning the stability of the model, specifically indicating whether the model's is performing in line with financial logic and is robust (small changes in the input specification do no significantly change the predicted probabilities)
Further considerations: Stability of explanations
We also set out to investigate the robustness and stability of explanations provided by classical XAI methods as applied to financial problem sets. Recent work has shown that post hoc explanation methods are unstable (i.e., small perturbations to the input can substantially change the constructed explanations), as well as not robust to distribution shifts (i.e., explanations constructed using a given data distribution may not be valid on others) (Ghorbani et al., 2019; Lakkaraju & Bastani, 2020). A key reason why many post hoc explanation methods are not robust is that they construct explanations by optimizing fidelity on a given covariate distribution p(x) (Ribeiro et al., 2018; 2016; Lakkaraju et al., 2019b)—i.e., choose the explanation that makes the same predictions as the black box on p(x).
In this context, our team set out to investigate different aspects of whether explanations provided by the state-of-art XAI methods are stable and robust. Specifically, we have set out to investigate whether:
-
Similar data points/loan contracts have similar outputs and similar explanations
-
Explanations across different XAI methods are similar for similar data points
Concerning the first objective, we turn to graph theory in order to investigate whether loan contracts that are similar across all features we can observe about them, have the same explanations. Specifically, we estimate similarity networks of the underlining loan contracts i.e., a metric D that provides the relative distance between companies by applying the standardized Euclidean distance between each pair (xi; xj) of loan’s feature vectors.
A network estimated in this way would be fully connected, hence in order to have a clear visualization we then derived the MST representation of the loans. A minimum spanning tree (MST) or minimum weight spanning tree is a subset of the edges of a connected, edge-weighted undirected graph that connects all the vertices together, without any cycles and with the minimum possible total edge weight. Put differently, we are finding the most connected subgraph.
In the next step, we define an explanations distance measure (expDis).
where:
-
n are the top 10 features provided by the .
-
x_(i ) and x_(j) are the different pairs of loan feature vectors
-
SHAP are the specific SHAP contributions
The explanation difference formula takes the top n features of two loan contracts, adds up the squared difference of the contributions of each feature the two loan contracts have in common, and for each feature that is not common, adds up the square of the contribution. Finally, we take the square root of the sum to obtain the explanation distance between two loan contracts.
The results in both cases, the results suggest that loan contracts that are closer to each other have the same top explanatory feature.
In the next step, we investigate the codependence between the explanation difference and the spatial difference that emerges between loan contracts. In order to assess how explanation differ for the set of neighbouring nodes, in addition to the explanation distance (expDis), we also calculate the spatial difference between points. Here, we use a modified set of features to calculate the spatial distance – namely only features that were considered for the explanation difference were included in the spatial calculation as well.
Here are the results from a randomly selected loan contract and its 100 closest neighbours.
What we can observe when we look at the relationship between the explanatory and spatial difference is that there exist an almost linear association – meaning that points that are closer together have smaller explanatory difference and as the distance between the points increases so does the explanatory difference.
Finally, the research team also set out to investigate whether explanations across different XAI methods are consistent for the same data points. An example of the comparison made is presented below.
For the LIME plots, blue means that the feature contributes to the loan being classified as "active" and orange means the loan is more likely to default. For consistency we have reclassified the dependent variable for the SHAP calculation as well thus the plots below display the probability of the loan being paid. Thus, the red features contribute to higher probability of the loan being paid, while the blue feature are those that push the prediction towards default.
What we can see is that the top 5 features are the same for both LIME and SHAP, although the magnitude of contributions differs among both models. For example, number of revolving trades opened in past 12 months has the highest contribution to the "active" class in SHAP, while the same feature has a small contribution to "default" in LIME. This could be because LIME looks at the local area, where this feature has a small correlation to the default status.