Prioritizing model validation tasks

Large banks have over 3000 models in production, and all of these algorithms require periodic review (re-validation). Given that an average validation typically takes around 4 – 6 weeks, this is a huge undertaking. Cycling through all models in a sequential manner might therefore not be the best solution because different models carry a different amount of model risk. A better strategy is so-called risk-based prioritization.

In this approach, a validation team organizes the (periodic) review by taking into account the risk associated to each given model. This is most often implemented by using so-called model risk tiering. The tier of a model is determined through qualitative and quantitative measures such as materiality, risk exposure and regulatory impact. Typically, qualitative assessments such as model complexity are fairly constant over time. Quantitative metrics such as the amount of risk that is managed through the model can change quickly since those quantities are dependent on the context in which the model is used. This is the main driver for using quantitative methodologies as part of the model risk tiering.

Frank H. Knight, the university of Chicago Centennial catalogues

When it comes to quantification, there are a few important concepts that have been introduced long ago by Knight. Let’s assume we have a quantity that we would like to model (such as the NPV of a swaption). This quantity has a (finite or infinite) number of potential outcomes. We talk about uncertainty when we do not know the probability of each of the outcomes, while we use the term risk when we do have such knowledge.

When it comes to building models, let us assume that we have a set of candidate models that can be used to describe the behavior of the quantity of interest. When we know the likelihood that a given model is the right one, we talk about model risk, while otherwise we use the concept of model uncertainty. We speak about model ambiguity whenever there are multiple probability measures available on the set of models P.

In case we know the likelihood of different outcomes, we can apply standard management practices like diversification to manage the risk. When a probability distribution however is not known, a typical approach is to work off the worst case and put aside a certain amount of capital as an insurance premium.

In the case of model risk quantification, a fairly standard approach is to use Bayesian analysis. In that case, we start by introducing for every model a prior distribution over the model parameters indicating the likelihood that any particular set of model parameters is correct. As we moreover have multiple models (each with their own set of parameters) we also have to assign a prior distribution indicating how likely it is that a given model is the correct one. This procedure allows for expert input since a model validator can quantify which model, according to her or his experience, is the right one.

As we then observe different outcomes of the quantity that is being modeled, we can compute the posterior probability on each candidate model which is the likelihood that the model is true, after having observed the outcomes x. This allows us to update our knowledge about the various models as we collect more observations. Using these Bayesian probability distributions, we can then compute model dependent quantities by taking the expectation over all available models and integrating over the distribution of the parameters. Concretely, by applying this approach we can compute the expected value of an option as it would be computed through a Black-Scholes formula, Heston model and a local volatility model. The same approach also allows us to compute uncertainty measures on this expected value such as the standard deviation on the NPV of an option which could serve as a measure of model risk.

When there are no probability distributions available, we can quantify model uncertainty using the so-called max min principle. Assume that an agent has several scenarios at his disposal. With the max-min principle, he will choose the scenario that maximizes his expected utility. For the expected utility, we use the worse case over all available models (the minimal outcome). Concretely, if we are long an option, and we have Heston, local volatility and Black-Scholes available, each with say three possible parameter choices, then according to the above, we should value the option by applying the formula (and parameters) that lead to the lowest possible NPV (let us call this NPV[w]). If moreover we have the possibility to sell the option at a price P, then using the max min principle, we will sell whenever P>NPV[w].

In conclusion, the need to organize models by their riskyness is driven by the fact that banks want to optimally allocate resources to the right models. Although such model tiering is driven both by qualitative and quantitative features, we have focused on the latter since these aspects tend to vary in a more dynamic fashion. We have highlighted two well-known methods. The Bayesian approach leads to more stable outcomes (due to model averaging) but is hard to implement (since assigning prior distributions is a difficult task). The max min (i.e. worst case) approach is often encountered in the industry but due to its nature may lead to more volatile outcomes.

Jos Gheerardyn

Generative adversarial networks

Generating realistic data is a challenge that is often encountered in model development, testing and validation. There are many relevant examples. In valuation modelling, we need market data such as interest rate curves and volatility surfaces. These objects have an intricate structure and strong constraints from non-arbitrage conditions. Hence, generating them randomly in a naive fashion is bound to fail. Random curves and surfaces could indeed expose issues with a model but since such configurations are extremely unlikely to realize, the added value of a test like this is rather low.

Another scenario where generation is needed, is the case of sparse data. When e.g. studying low default portfolio's, by definition there is not a lot of data to train the models on. Hence, proper data generation techniques are extremely important.

One traditional approach, especially in the context of time series, is to perform a principal component analysis on a set of data samples (such as IR curves). Once the principal components are determined, we can then construct the distribution of the strength of every component. Generating a new sample then simply means that we sample from these distributions and reconstruct the data using the components. Such an approach works fine for data sets that are not too large. However, for big datasets, or in case the data does not have a time dimension, we need other techniques.

Generative Adversarial Networks (see Goodfellow I. et al) are a very powerful technique to build extremely realistic data sets. The algorithm consists of two neural networks. A first network, the generator, creates candidates while the discriminator attempts to identify wether the candidate originated from the real dataset or if the generator created a synthetic sample. By repeating this procedure, the generator becomes more and more accurate in creating realistically looking candidates while the discriminator becomes better at identifying deviations from the real dataset. Once the system is trained, one can use the generator to create very realistic samples.

Apart from generating realistic datasets, GAN's have many more applications in finance. To end with one interesting use case, in this paper Hadad et al. use GAN's to decompose stock price time series in a market and an idiosyncratic component.

Sparse data challenges in IFRS 9

Building mathematical models requires data. Validating algorithms needs an even larger sample set because one has to determine if a model works well when used in real life (on data that the model has never encountered before).

For institutions building or validating credit models, this is often challenging on many fronts, first of all because a credit model requires long historical data spanning multiple economical cycles. Operationally, long timeframes imply that the data has been stored in different systems, with different schema and conventions. To use the data, one therefore needs proper data governance.

The very nature of credit data also leads to many intricate mathematical problems. A typical example is related to building a model for a low default portfolio. Such model needs actual defaults to train, and these are by definition sparse which leads to various problems such as class imbalance. To illustrate class imbalance, let us look at a credit model as a simple classifier, i.e. a model that determines whether or not somebody is going to default in the next quarter. Suppose that our data contains .01% defaults. In that case, a model that simply predicts that nobody at all will go bankrupt already has a 99.99% accuracy.

In IFRS 9, the situation is even more challenging. As discussed in a previous post, IFRS 9 recognizes two types of performing credit exposure. Stage 1 exposures have experienced no significant change in credit quality since origination and impairments are based on a one-year expected credit loss (ECL). Stage 2 exposures have experienced significant deterioration and impairments will be based on lifetime ECL—that is, the probability of defaulting during the whole life of the exposure, taking into account current and future macroeconomic conditions. This lifetime ECL requires long datasets. As an example, assume that in order to calibrate our credit model, we would need 10Y of data. If we would be building a regression model for a 20Y forward default, then naively we would need at least 30Y of data. This is often hard to come by.

There are several approaches that modelers can take to overcome this sparse data challenge and we list a few below.


The first and most natural approach is to extrapolate to longer time scales. This can be done by e.g. assuming time homogenous behavior past the point for which we have enough data (the data horizon). This could lead to keeping the default rate constant or to having a constant markov chain model to govern rating migration. When validating such assumptions, time homogeneity should be verified statistically.

Data augmentation

Another approach to yield more data is so-called data augmentation which is a technique to build more training data by applying transformations on the existing dataset.

Data augmentation is often used in image processing where one applies e.g. mirroring, rotation, scaling and cropping to generate many more (labelled) samples. When dealing with credit data, similar transformations can be applied. As an example, if the dataset would include geographical information, we can e.g. shift the address to generate a new sample. In more general terms, if we discover (or assume) a symmetry in the dataset, we can leverage it to generate many more samples by applying that symmetry transformation on the data.

Another data augmentation technique is to add various levels of noise. This is an interesting technique in its own to avoid overfitting.

When validating models that rely on data augmentation, we have to verify the assumptions that underpin the data generation itself.

Synthetic data generation

We can generalize the above more by using mathematical models to actually generate the data. This idea is closely related to so-called generative adversarial networks, a topic of active research.

This approach means that in order to test whether our credit model can forecast defaults, we verify that it would work on data that was generated by another model. The generator itself will be parameterized and we can sample various parameter configurations to make sure that the credit model works well on a wide range of datasets.

In order to validate such models, we have to verify that our real data is consistent with the generator.


In this post, we have reviewed a few techniques to deal with sparse data. Every approach has some very specific underlying assumptions that have to be validated carefully in order to guarantee that the model itself can be trusted.

The consequences of IFRS9 on Model Risk

IFRS 9 is an international financial reporting standard that was published in July 2014 by IASB and went live this year. In the present post, we discuss a few key-aspects of the framework that are particularly relevant from a model risk management point of view.

Fair value through profit or loss

The new standard has a large impact on the stability of the reported profit and therefore on capital consumption and profitability. This PnL volatility is caused in part by the fact that IFRS 9 relies more heavily on valuation models. Under IFRS 9, a financial instrument has to meet two conditions to be classified as amortized cost: the business model must be “held to collect” contractual cash flows until maturity, and those cash flows must be solely payment of principal and interest (the SPPI criterion). Financial instruments that do not meet this criterion will be classified at fair value, with gains and losses treated as other comprehensive income (FVOCI) or through profit or loss (FVTPL). Hence, for those financial instruments, the value will be constantly adjusted to the current market value. In order to generate stable PnL, financial institutions need to put in place checks to guarantee that the market data feeding the valuation models is of high quality. Moreover, the pricing models themselves have to be rigorously tested in order to avoid unwanted volatility due to e.g. unstable calibration or poor numerical convergence.

The second reason for increased variability of the reported profit is more related to changes in the impairment model. IFRS 9 recognizes two types of performing credit exposure. Stage 1 exposures have experienced no significant change in credit quality since origination and impairments are based on a one-year expected credit loss (ECL). Stage 2 exposures have experienced significant deterioration and impairments will be based on lifetime ECL—that is, the probability of defaulting during the whole life of the exposure, taking into account current and future macroeconomic conditions. Although stage 2 obviously implies a higher default risk and therefore a shorter expected lifetime of the exposure, the transition from stage 1 to stage 2 typically introduces a discontinuous jump in the expected credit loss.

Due to larger credit provisions caused by lifetime expected credit loss for exposures in stage 2, profitability of transactions for clients with higher risk of migrating to stage 2 is under pressure. This is why some banks are developing asset light "originate to distribute" business models, where these products are originated for distribution to third-party investors. This however requires financial institutions to be able to compute the fair value of each individual corporate loan in real time, to respond to market opportunities quickly. The valuation models that can be used for this type of pricing require large amounts of data and fairly complicated pricing routines. In order to guarantee sustained commercial viability of such business model, it is therefore important to continuously monitor both data and models to avoid mis-pricing.

Pricing at origination

Financial instruments that migrate to stage 2 require higher credit loss provisions and therefore consume more capital. In order to sustain profitability, banks must take this into account when pricing new transactions. Such a valuation strategy has a few key challenges. First of all, one needs to be able to compute the probability of migrating between the various stages.This can e.g. be done through a monte carlo computation where one simulates migration over the lifetime of the portfolio and computes the resulting ECL continuously. In the simulation, this then leads to provisions (and hence a(n opportunity) cost) through time which is used to compute a cost of origination. This approach leads to smooth pricing behaviour since the simulation weighs all possible scenario's proportional to their likelihood in a continuous fashion.

Another challenge related to pricing is the fact that banks have to take into account the current economical environment for the ECL computation. This means that we need to use so-called point in time (PIT) estimates for the probability of default as opposed to the more standard through the cycle (TTC) estimates that are used in the context of Basel. Creating PIT from the long-range TTC estimates requires the introduction of a seasonal component that can be noisy to estimate. It is however important for banks to fully understand the uncertainty in the PIT estimate because an inaccuracy will lead to unanticipated volatility in the credit loss provisions which will pressure the long term profitability of the client base.

Early warning systems

Apart from introducing flexible pricing that would allow banks to incorporate the cost of increased PnL volatility, institutions might also introduce early warning systems that detect exposures at a high risk to migrate to stage 2. Such system however requires a sophisticated algorithm that can first of all compute this migration probability accurately but also can evaluate the impact of remediating actions with a sufficient degree of certainty. Building such models therefore requires a large historical dataset, with high quality data.


IFRS 9 has prompted a flurry of activity in mathematical modelling, first of all because impairments are accounted differently. However, financial institutions have quickly realized that this new standard impacts profitability paving the way to new business models, workflows and practices. Many of these new initiatives heavily rely on sophisticated mathematical models. Managing properly the risk associated to these models will give banks a sustainable competitive edge.

Fairness and AI

More and more companies are automating processes with the help of ML. This has tremendous advantages since an algorithm can scale in an almost unlimited fashion and in case rich datasets are available, can often detect patterns in data that humans cannot discover. As a well-known illustration, think about how Deepmind used a ML algorithm to optimize energy consumption of Google's data centers. When algorithms are used to make important decisions that impact people's lives, such as deciding on medical treatment, granting a loan, or performing risk assessments in parole hearings, it is of paramount importance that the algorithm is fair. Because the models become ever more complicated, this is however not easy to assess. As a consequence, both the public, legislators and regulators are aware of this issue; see e.g. a report on algorithmic systems, opportunities and civil rights by the Obama administration and (in Europe) recital 71 of GDPR.

In order to detect bias/unfairness it is important to come with a proper (mathematical) definition to make sure that we can measure deviations. Below we describe a few alternatives.

Let us assume we are building a model to determine whether somebody may receive a loan. Typically such model will use information (so-called attributes) such as your credit history, marital status, education, profession, etc. in order to estimate the probability that you will be able to pay off your debts. The dataset will also include historical data on defaults (i.e. people who were not able to repay their loan). Now, given the above, the financial institution wants to make sure that the algorithm is fair with respect to gender, skin colour, etc. These are called protected attributes.

A traditional solution is to simply remove these protected attributes from the dataset altogether. Of course, when testing such model, simply changing the value of any of the protected attributes is not going to impact to output of the model. However, through so-called redundant encodings an algorithm might be able to guess the value of the protected attributes from other information. As an example, let us assume we train our credit model on a dataset representing the people of Verona. The dataset has a protected attribute called ancestor that can take two values (Capulet and Montague). If people whose ancestor attribute equals Capulet live primarily in the Eastern part of the city while the Montagues live in the Western half, then our algorithm can infer the ancestor attribute (statistically) by looking at the address. Hence, simply removing the ancestor attribute from the dataset does not help.

Another approach is so-called demographic parity. In that case one requires that the membership of a protected attribute is uncorrelated with the output of the algorithm. Let us assume that granting a loan is indicated by a target binary variable Y = 1 (the ground truth) and that the protected (binary) attribute is called A. The forecast of the model is called Z. Demographic parity then means that

Pr[ Z = 1 | A = 0] = Pr[ Z=1 | A=1]

This notion however is also flawed. A key issue with the approach is that if the ground truth (i.e. default in our example) does depend on the protected attribute, the perfect predictor (i.e. Y=Z) cannot be reached and therefore the utility and predictive power of the model reduce. Moreover, by requiring demographic parity, the model has to yield (on average) the same outcome for the different values of the protected attribute. In our example, demographic parity would imply that the model would have to refuse good candidates from one category and accept bad candidates from the other category in order to reach the same average level. Concretely, assuming that the Montagues historically have defaulted on 3% of their loans while the Capulets only on 1%, demographic parity would typically be in the disadvantage of the Capulets.

A more subtle suggestion for fairness was proposed by Hardt et al:

Pr[Z=1 | Y=y, A=0] = Pr[Z=1 | Y=y, A=1]

for y=0,1. In other words, relatively speaking the model has to be right (or wrong) as often for either value of the protected attribute. This definition incentivises more accurate models and the property is called oblivious as it depends on the joint distribution of A,Y and Z.

In this short post, we have reviewed a few different notions of fairness, focussing mostly on the simplest case where the protected attribute and the ground truth are both binary variables. The oblivious property introduced above is an interesting approach that can be easily computed. The simple (binary) set-up we discussed can be extended to the multinomial and continuous case and as such can serve as an accurate test to detect fairness in algorithms.

AI and model validation

Model validation is a labor-intensive profession that requires specialists who understand both quantitative finance as well as business practices. Validating a single valuation model typically requires between 3 and 6 weeks of hard work focusing on many different topics. In this first article, I would like to highlight a few ideas on how to automate such analyses.

see Global Model Practice Survey 2014 by Deloitte


Models need. Hence, a large portion of time in model validation is spent on managing data. To give a few examples, the validator has to make sure that the input data is of good quality, that the test data is sufficiently rich and that there are processes in place to deal with data issues. Broadly speaking, input data exists in two flavours: time series and multi-dimensional data that is not indexed primarily by time.

Time series

Time series are overly present in finance. Valuation models typically use quotes (standardized prices of standardized contracts) as input data. Since the market changes, time series of quotes are abundant. To determine the quality of quotes one can run anomaly detection models on historical data (i.e. on time series). Many such algorithms exist. Simple ones use time series decomposition like STL to subtract the predictable part and study the distribution of the resulting residue to determine the likelihood of an outlier. More modern alternatives use machine learning algorithms. In one approach, a model is trained to detect anomalies by using labeled datasets (e.g. such as the ODDS database). Another family of solutions uses a ML model to forecast the time series. In that method, anomalies are discovered in case the prediction differs significantly from the realization.

Data representativeness

In some cases, the time component is less (or not) important. When building regression or classification models, one often uses a large collection of samples (not indexed by time) with many features. In that case, due to the curse of dimensionality, one would need astronomical sizes of data to cover every possibility. Most often, the data is not spread homogeneously over feature space but instead is clustered into similarly looking parts. When building a model, it is important to make sure that the data for which the model is going to be used is similar to data on which the model has been trained. Hence, one needs to understand if a new data point is close to existing data. There are many clustering algorithms (like e.g. DBSCAN) that can perform this analysis generically.

Model dependencies

A model inventory is a necessary tool in a model governance process. On top of keeping track of all the models used, it is extremely valuable to also store the dependencies between models. For instance, when computing VaR one need the historical PnL of all portfolios. If these portfolios contain derivatives, we need models to compute their net present value. All the models that are used within an enterprise can be represented as a graph where the nodes are the models while the vertices represent data. Understanding the topology makes it easier to trace back model issues.

model dependency graph


ML can help benchmarking models. First of all, in the case of forecasting models, we can easily train alternative algorithms on the realizations. This will help you to understand the impact of changing the underlying assumptions of the model.

However, even in case such realizations are not available, we can still train a model to mimic the algorithm that is used in production. We can use such surrogate models (see e.g. this paper from our colleagues at IMEC) to detect changes in the behaviour. As an example, suppose we are monitoring a XVA model. We can train a ML algorithm to predict the changes in the XVA amounts of a portfolio when the market data changes. We can still build a model that tries to forecast the change in PnL with market data. Such model can be used to detect e.g. instabilities in the XVA computation. An additional benefit of such approach is that one can estimate the sensitivity from the calibrated model.

Jos Gheerardyn

Welcome to our blog

Welcome to the Yields new blog! Our blog will give us a unique opportunity to share news and updates, while also offering a place for us to interact with our model validation community

Interested in a demo?

Lorem ipsum dolorem et arceopara bellum. Lorem ipsum dolorem et arceopara bellum. Lorem ipsum dolorem et arceopara bellum.