Machine Learning Workshop
With the trend towards increasing computational resources and larger datasets, the application of machine learning (ML) in finance has gained attraction. Financial Institutions are interested in how and where ML models can be of added value in their business model.
During the past two years, Zanders has been conducting research in the applicability of ML in the area of asset and liability management (ALM) and credit risk. One of these research studies comprised probability of default (PD) estimation for corporate and retail exposures. The results seemed relevant and insightful, and therefore we decided to share these with our clients in the form of a Machine Learning Workshop. In this article we share the main results.
We have organized such a Machine Learning Workshop multiple times in the Swiss market to illustrate how ML can be of added value within Financial Institutions. The workshop took two hours and was held for Model Development and Model Validation. Prior knowledge of credit risk or ML was not required. The workshop started with the introduction of a classification problem and an extensive explanation of the statistical models and ML models that can be used to address a classification problem. We then switched to PD estimation (which is a classification problem) and performed the estimation with statistical models and ML models. In the final section, we compared the results and concluded, among others, on performance, suitability, and interpretability.
Classification is the problem of identifying to which set of categories a new observation belongs, based on a training set of data containing observations whose category membership is known. An example of a classification problem is classifying an e-mail as spam based on the words that are used in the e-mail. Other examples are classifying an Iris flower as Setosa, Virginica or Versicolor based on the sepal width and sepal length and classifying a corporate as defaulted based on its financial ratios.
To solve classification problems, it is common to use traditional statistical models, which are robust and interpretable. The statistical models that are in scope of the workshop are Linear Discriminant Analysis (LDA), Logistic Regression, LASSO (least absolute shrinkage and selection operator), Ridge Regression and Elastic Net. LDA assumes that the input data come from a multivariate Gaussian distribution. A class prediction is done by estimating the parameters of the distribution and determining the linear discriminant functions. The prediction is determined by the function that gives the highest value for the input variables. Logistic regression, the most often-used binary outcome model, classifies the samples using maximum likelihood estimation (MLE), which means selecting the variables more likely to lead to the correct classification given our data. MLE provides unbiased estimators and therefore sometimes includes many explanatory variables in the model. Allowing some bias by limiting the number of explanatory variables might lead to higher predictive accuracy and better interpretability. Ridge Regression, Lasso Regression and Elastic Net do exactly this, by penalizing fewer contributory variables. Details and properties of the statistical models are provided in the workshop.
ML models in scope
Besides statistical models, ML models can be used to address classification problems. The ML models that are in scope of the workshop are: Decision Tree, Random Forest, Gradient Boosting, Extreme Gradient Boosting and Neural Networks. Decision Tree, as its name suggests, is a tree of various decisions and branches that partition the data into different categories. Please refer to Figure 1 for an illustrative decision tree example of a simplified PD classifier model based on only two risk factors: age and income.
Figure 1 Decision tree example of a simplified PD classifier with 2 risk factors: age and income. The first decision is whether (drawing on experience from a large sample of historical observations) a potential future client is classified as having a low or high probability to default (PD) on their loan obligations based on an age cut-off (in this case 75 years). The applicant older than 75 (right tree branch) is classified as having a high PD, while the applicant under 75 (left tree branch) moves on to the next decision knot. Here the applicant under 75 and without a stable income (left tree branch) is again classified as having a high PD, while the applicant under 75 and with a stable income finally is classified as unlikely to default on the loan (low PD).
Random Forest builds a large collection of de-correlated decision trees and then averages the results, whereas Gradient Boosting uses multiple decision trees to minimize the error in the prediction*. Finally, we elaborate on Neural Networks. Based on the neural networks that we are familiar with from biology, the assumption is that the data can be modeled in an input, output and one or more hidden layers. The algorithm is trained to work with a subset of the data to establish the relationships (weighting) between these layers as well as possible. These relationships are then used to predict the output layer for a given input layer. During the workshop we elaborate on the mathematical properties of the different ML models.
PD estimation for corporate and retail exposures
In order to capture a wider variety of portfolio structures, we applied the above-mentioned models to two distinct portfolios. The first one consists of 80,000 corporate loan-level observations, spanning eight European countries. It contains balance sheet and profit and loss information, with an overall share of defaults of under 1% of all loan-level observations.
This share is subsequently preserved when splitting the sample into a training sample with 80% of observations, and a testing sample – with 20%. The advantage of such a portfolio is that external ratings are available to benchmark and cluster, thereby allowing for prediction power comparison within countries and rating classes. The second portfolio consists of over 600k mortgage loan-level observations, covering 50,000 unique credit lines with exposure from fifty US states. Indicative of composition differences across types of portfolios, the mortgage dataset has a much higher share of defaulted exposures at almost 2.5%.
The results of the analysis allowed us to conclude on performance, interpretability, and suitability of the different models.
Comparing the two sets of PD models across the two portfolios (corporate loans and retail mortgages) we observe that selected ML models such as Extreme Gradient Boosting outperform all statistical models in terms of predictive power, including the popular Logistic Regression (see Figure 2). Moreover, we found that model performance pattern is consistent across portfolios for corporate loans and mortgage loans. Finally, we observed that the results of the various statistical models are close to each other whereas the outcomes among different ML models vary materially.
Figure 2 Comparison of model performance between Logistic Regression and Extreme Gradient Boosting
as measured by the area under the receiver’s operating curve (AUC).
Using the same example as above comparing the Logistic Regression with Extreme Gradient Boosting, we state that the prediction of probability of default varies materially in the underlying distribution. This discrepancy can have an impact on the overall portfolio rating (see Figure 3).
Figure 3 Comparison of the predicted distribution of Probability of Default (PD) between Logistic Regression and Extreme Gradient Boosting.
The ML models clearly outperformed the statistical models. However, we found that the dependency structure between the input variables and the target variable were difficult to identify with ML models. For statistical models we were able to determine how much predictive power each individual input variable had, this was not straightforward with all ML models.
The limited interpretability of the ML models poses a challenge for their implementation; explaining the results to a regulator and/or a validator is difficult (if feasible). We therefore recommend implementing the ML models as challenger models for statistical models to:
- Get more understanding of the relationship between the dependent variable and explanatory variables;
- Identify potential data and/or model issues;
- Make a good trade-off between the gain on predictive power versus the costs to switch from a statistical model to an ML model.
The workshop triggered interesting discussions on model development, model validation, regulatory requirements and ML models. After sharing experiences and knowledge, the bank decided to implement some of the ML models to possibly use them as challenger models.
The outcomes of the analysis fueled various research initiatives within Zanders. It made us aware of the fact that mitigating the interpretability shortcoming of ML models can contribute to the decision switch towards more use of ML models. Therefore, our next steps are to enhance interpretability by the use of new techniques such as hybrid models or game theory methods to explain output of ML models (e.g. Shapley Additive Explanations). Additionally, we aim to host more workshops as platforms to explain our challenger model features and further customize our support to model development and model validation functions.
*) Given the granular power of such a model, the predictions often approach the true values too much (the so-called overfitting), which limits the use of the model for other datasets. To control this unwanted effect, we also implement Extreme Gradient Boosting, which regularizes (i.e. penalizes fluctuations that generate extreme values).