Introduction
After periods of hype followed by several AI winters during the past half-century, we are experiencing an AI summer that might be here to stay. Machine learning now drives many real-world applications in the financial sector, ranging from fraud detection to credit scoring.
Innovation through AI promises considerable benefits for businesses and economies through its contributions to productivity and efficiency. At the same time, the potential challenges to adoption cannot be ignored. In the present paper, we focus on these from a model risk perspective. To make the discussion as concrete as possible, we will illustrate many aspects in the context of credit models. However, most of the conclusions can be straightforwardly translated to other applications that are relevant for the banking sector.
Table of contents
1. Introduction
Validating AI
A typical validation process starts by determining the scope of the model as well as how it relates to other algorithms. After these introductory steps, a model validator analyzes various aspects of how model input data is managed and continues by measuring model performance. Before concluding, a proper validation process also studies the actual implementation as well as the level of documentation.
In a recent speech by the FED,6 it was suggested that both regulators and financial institutions should address the problem of validating AI applications by starting from what already exists. This is exactly what we will do in the present chapter.
Scope
A model is always used within a certain scope. E.g. we build credit models to deal with credit risk in mortgages for the Belgian market or create an interest rate curve generator to value EUR collateralized derivatives. When the scope changes, most model risk frameworks require a new validation.
In the context of machine learning applications, the scope of the model can be quite different from what validators are used to. As an example, when validating a chatbot that is used to assist clients in finding the best possible loan, the scope is rather wide. Hence, to make validation manageable, we suggest dissecting the algorithm in its individual components and validate those separately. E.g. in order to give proper advice to the client regarding the loan, the chatbot algorithm probably performs several analysis:
- Natural language processing to understand the user’s questions
- Data validation and cleaning to interpret the client’s input
- Forecasting (cash flows, creditworthiness, …) to be able to compute the best possible loan
Each of these components is a model on its own and requires a separate analysis.
This scoping exercise also implies that the validator is able to determine what are all the possible model output values, and this can be a complicated task (e.g. in the context of chatbots 7).
Because of this complexity, it is mandatory that an AI algorithm comes with a monitoring platform that is able to determine if the AI is operating within the proper scope. Such a framework would ensure that the chatbot’s answers are acceptable and would simultaneously determine that the client’s questions remain within scope (e.g. the chatbot should deflect a question on an insurance product when it is designed to deal with loans only).
See https://www.federalreserve.gov/newsevents/speech/brainard20181113a.htm
Dependent models
In quantitative finance, we often encounter cases where multiple models depend on each other. One of the most straightforward examples is a market risk model (such as VaR or ES) which uses PnL vectors as inputs. Those vectors themselves are being computed by valuation models and the latter depends on market data generation algorithms like curve construction and volatility surface calibration routines.
The main difference when dealing with ML is that very often the model will have a circular dependency. Indeed, many AI models are trained in a dynamic fashion as opposed to the more traditional credit or valuation models that are calibrated in an off-line regime. In dynamic training, the algorithm attempts to fit the data in small batches. This is an obvious advantage that the model will dynamically adapt to changing environments. However, simultaneously, it can make testing harder because the state in which the trained model will end up, will depend on the sequence in which training data was fed to the algorithm. To ensure reproducibility, a good approach is to implement the training procedure as a stateless operation. Concretely, this could be realized via a calibration routine that takes as input the previously calibrated model as well as the new training data and outputs the updated model. Such implementation would allow the validator to test various conditions and edge cases more easily. It would also help the developer to reproduce issues in the training.
7 See https://en.wikipedia.org/wiki/Tay_(bot) for an example of a chatbot experiment that spiralled out of control.
Documentation
Another key aspect of a validation exercise is to verify the level and quality of documentation. Many open-source libraries for machine learning are fairly well documented. However, when building an actual algorithm, a developer has to make a myriad of micro-decisions especially when determining the lay-out of the algorithms. A very powerful way of documenting those is to use diagrams. There is no unique solution yet on visualizing AI layouts, but in the context of neural networks, a few promising approaches already exist. One such example is Tensorboard’s graph visualization approach8 (see Figure to the left).
Another approach that is often used in the context of Convolutional Neural Networks is the so-called Krizhevsky diagrams (see below). Such diagrams depict the size of the various layers together with the convolution operators that map one layer onto the next.
Apart from the layout of the algorithm, also the data itself is often more intricate when compared to more traditional modelling approaches. Developers should therefore leverage dynamic inspection tools as a source of documentation that allows validators to analyze the data in greater detail. One open-source example is Google’s Facets.9
Model design and performance testing
Conceptual soundness
In the case of ML, conceptual soundness refers to three different aspects. First of all, the various algorithms should be properly implemented. Given the extensive use of existing open-source libraries with a very large and active community, this aspect can be considered to be managed by the community – although many validation teams will want to perform an independent check as well.
The second component is the methodology of how input features are engineered and selected. This includes the fact that input data should be properly normalized, discrete categories should
be correctly encoded (i.e. mapped to numerical attributes in a logical fashion), and features should not display large amounts of correlation. When engineering features, one needs to be mindful of the dimensionality (i.e. the units) of the input data. The good advice from more traditional engineering approaches to prefer dimensionless quantities still holds in the era of AI. A final measure of conceptual soundness is the care the developer displayed when splitting the data into train, validation and test sets. The train set is used to calibrate the model. The validation set allows for an independent evaluation of the model while tuning hyperparameters (these are the parameters determining the layout, fitting algorithm, etc.). Finally, the test set is used to evaluate the overall performance of the final model. Because of the huge amount of parameters, the risk for overfitting is often conceived to be larger than with more classical modelling approaches. Hence the need for proper separation of the datasets.
Model selection
As with any model, when building a machine learning application, choosing the right algorithm is crucial. Compared to traditional valuation models, where the selection of the right diffusion process can be verified by analyzing historical time series of market data, selecting the right ML algorithm often depends on a set of heuristics. In order to develop an opinion on which algorithm to select, as a validator, one approach is to analyze many models. Using open-source libraries like TensorFlow,11 makes this easy once the data is brought into the right standardized format. An alternative approach is to borrow ideas on selecting ML algorithms from e.g. SciKit learn (or build your own selection process as part of an ML validation framework).12
Once the actual algorithm has been selected, the next step in the development process is to choose the right parameterization/layout. Indeed, as explained before, many ML algorithms have plenty of parameters that govern the training and configuration of the model itself. As an example, when building a neural network one has the specify the size, layout and connection topology of the neurons, the type of activation functions, the method for updating the weights, etc. Sometimes, parameters are chosen through a process called hyperparameter tuning. In this process, the developer trains the algorithm in different configurations in order to find the best possible approach. As a model validator, verifying the hyperparameter tuning algorithm is equally necessary to understand the stability of the resulting model, the details of the optimization function, the topology of the hyperparameter manifold, etc.
11 See https://www.tensorflow.org/
12 See https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
Backtesting
Analyzing model performance on previously unseen data is obviously crucial. As we already highlighted in section 3.5.1 on Conceptual Soundness, a proper separation between train, validation and test data, therefore, is important. To validate this in detail, one needs to gauge the implicit uncertainty caused by any particular choice of data separation. Hence a good approach is to consider several ways of separating the data, both randomly or in a structured fashion.
A backtest should then measure the variability in the training as well as the change in algorithm behaviour on scoring. The traditional measures used in outcome analysis can be used for this (see next section 3.5.4). In addition, a proper backtest also compares model performance to one or multiple alternative models (benchmarks, see section 3.5.5).
Outcome analysis
Due to the large amount of data and the complexity of ML models, simple summarizing metrics such as AUC and MSE are not sufficient. This is why various interactive performance analysis tools have been developed to assist validators (and developers) to increase their understanding of the model behaviour.13
13 See S. Liu, “Towards better analysis of machine learning models: A visual analytics perspective”, (2017) Visual Informatics, Vol1, issue 1, https://www.sciencedirect.com/science/article/pii/S2468502X17300086
Squares14 is one such example that allows machine learning experts to troubleshoot classification algorithms efficiently. The tool shows prediction score distributions at multiple levels of detail. The classes, when expanded, are displayed as boxes. Each box represents a training or test sample. The colour of the box encodes the class label of the corresponding sample and the texture represents whether a sample is classified correctly (solid fill) or not (striped fill). The classes with the least number of details are displayed as stacks.
Benchmarking
Benchmarking is a useful tool to understand the impact of changing assumptions as well as to perceive the relative strengths and weaknesses of an approach.
There are two broad families of ML benchmarks. First of all, a validator should change hyperparameters to verify the impact on the outcome as one moves through parameter space as well as what happens in edge cases where the model approaches the parameter space boundary. This analysis is therefore very much like how one would study the impact of changing the number of factors when validating a HW model for pricing Bermudan swaptions.
Simultaneously, when creating benchmarks it is also helpful to change the actual algorithm. Since many ML libraries have a standardized API to train models, creating benchmarks that are fundamentally different often does not require a large time investment. A good candidate benchmark model is a classification or regression tree because such model is very fast, and thanks to its simplicity, guarantees interpretability of the results.