Why we are disrupting model risk management

Two years ago, we embarked on a fantastic adventure to build a platform for better model risk management. Although this is often considered a niche topic, our determination was rooted in the firm belief that we need better tools to build robust models. The journey since then has not only confirmed but also consolidated that view. With the advent of machine learning and AI applications, the problem of creating high-availability algorithms, with a probability of failure below .001% which is needed in business critical applications such as health care, self-driving cars or finance, has made that lack of tools even more acute.

This contribution is written for the model risk executive who is looking for information on where this field is headed in order to build a future-proof strategy. Our vision will clarify why disruption is needed to allow people to manage the risk of advanced analytics.

George Box

All models are wrong

This quote from George Box, taken from his 1976 paper on Science and Statistics clearly points to the inevitability of model risk. As a direct consequence, model risk management is concerned with managing exactly this risk that models will inevitably produce false results. Embedding this certainty in a company’s approach to analytics is exactly what model risk management is about.

Since model failure is an indivisible part of modeling, the scope of model risk management is extremely wide and is absolutely not restricted to regulatory models. A proper model risk management framework therefore covers all analytics (such as credit & liquidity risk, scorecards, decision models, fraud & AML detection, valuation and market risk, chatbots and marketing analytics to name a few). In addition, managing this risk efficiently requires sound governance that impacts the entire organization.

This is why proper risk management is governed through the so-called three lines of defence. The first line is responsible for delivering an exhaustively tested and documented model. The second line asserts this independently and challenges the first line in case of doubt. The third line verifies both qualitatively and quantitatively that first and second line work together correctly, according to company-wide standards, both in design and in actual operations. In other words, model risk is everybody’s responsibility, from the model user, over the quant, to the manager and the board member. Although the above principles are well-established, on the ground we notice that model risk management is sometimes being narrowed down to a set of monotonous tasks; ticking regulatory boxes, performing repetitive jobs such as reimplementing a model somebody else has created as well as generating massive amounts of documentation that no human can ever consume in its entirety. Because of this aura of boredom, many talented people prefer to move into model development which is considered the place where the action can be found and budgets are being allocated. Simultaneously model risk teams face massive challenges finding and retaining good people. Also at the managerial level, we often see that businesses consider building new analytics - ideally incorporating AI - as the primary route to gain competitive advantage. In model building, budgets are large and can be allocated fast. Model risk, and especially second or third line of defence, is considered to be a mandatory cost center.

Over the last 15 years I have worked in institutions both large and small, developing models that managed some of the most complicated derivatives portfolios and engineering algorithms that controlled automatically 100’s of MW of industrial power consumption. Looking back at this, and comparing it with the tools that are available in 2019, I would like to argue that this view is flawed.

First of all, thanks to large efforts of the open source community, there is nowadays an abundance of high quality analytics. This is especially true in Machine Learning, where all important technology firms have open sourced considerable parts of their algorithmic frameworks (see e.g. TensorFlow, Microsoft Cognitive Toolkit, etc). Even the more classical fields such as valuation models now have a fairly rich set of libraries (such as QuantLib and ORE). All of this means that using sophisticated analytics will very soon stop being a competitive advantage. This transformation is happening at an exponentially increasing rate due to the advent of auto-ML. These frameworks, both commercially and open-source, allow virtually everybody who has a dataset to train hundreds of sophisticated machine learning algorithms (such as neural networks) and deploy these models instantaneously. A modeling team who is ready to leverage those tools and who has the data can build advanced models in days.

However, there is a vast gap between using a (ML) algorithm in a one-off proof-of-concept mode and running that algorithm stable in production. The former simply requires one to collect the data and perform a few trial and error iterations. Serving algorithms in production on the other hand means managing the dataflow, guarding data quality, defining fall-back strategies, monitoring model performance continuously and retraining/recalibrating the model whenever needed. This is a daunting task because even simply guaranteeing reproducibility of a model - which is a prerequisite to more subtle issues like bias and explainability - is often very hard to realize. This is why the field of machine learning is currently contributing to a reproducibility crisis in science .

To highlight the challenges of robust analytics even more, I would like to point out that many front office quant teams (i.e. first line of defence) in banks often become an indivisible part of their own analytics. These teams are constantly needed to fix issues as they appear, to finetune calibrations and perform small modifications to deal with additional edge cases in a continuous fashion. When the quants are gone, the models have to shut down. We call this the hybrid human-algo approach.

Transitioning to the AI era

This trivialization of model development is contributing to the high grow rate of models that are found in financial institutions. In a recent study of McKinsey, the yearly growth rate was estimated to be approximately 20%. This implies that the hybrid human-algo approach cannot be maintained and that we need to transition to highly integrated model risk management. In other words, an institution that endeavours to capture the full potential of machine learning will have to put model risk management first.

Let me detail how this would work in practice. At the beginning of the model development cycle, a project team is assembled. In the first design stage, this team studies the potential introduction of a new model to solve a concrete business problem. When requirements are gathered the team should immediately take into account the fact that this model will at some point fail. In order to manage that risk, the design of the model should focus on risk management, studying data quality, quantifying model risk and determining the feasibility of monitoring. Overlaying those risks and challenges with the estimated benefits and the risk appetite of the bank will allow the team to decide quickly what solution (a complex model, a simple one or no model at all) will be fit for purpose. This exercise at the beginning of the cycle will yield a design that allows for models that can be deployed in a robust fashion with clearly defined limits that can be monitored and managed in a completely automated fashion.

Thanks to the abundance of open source analytics and auto-ML solutions, the subsequent implementation stage will again be mostly concerned with model risk related topics. Key points to address here are the quality and representativeness of the data, the explainability of the model, and the level of testing and documentation that is feasible. In other words, at this stage the team can build new models in days, but the challenge is to determine which model fits best within the risk framework as defined in the first stage of the project. By putting model risk management first, more people will find their way into the second and third line since it will suddenly be clear that even more ingenuity is needed to understand the risks of models breaking down and to detect and explain model failure.

This vision can only be realized when an institution has the technology to manage its model life cycle correctly. A platform that integrates model development with validation and monitoring, that allows for agile workflows and close interaction between first, second and even third line. A platform that takes away the monotonous, repetitive tasks and allows risk managers to deep-dive into models, dissecting algorithm failure, explaining decisions to clients, and detecting issues in real-time.

Such a platform should also provide a more interactive view into models. If we want to manage the certainty that our models will fail, we have to replace static documents by dashboards showing real-time model health, interactive views into complex data pipelines and visualizations of model limits. As the world is slowly discovering that building a mathematical model is trivial, the industry is going to shift towards using agile technology platforms that give them the freedom to manage algorithms the same way that technology giants currently deploy code continuously.

The benefits

Recent advances in both technology and algorithms have shown previously unimaginable results, ranging from discovering new chess strategies to generating text that reads like it was written by a human. This full potential of AI can only be unlocked through a model risk centric approach. Model risk management makes the risks clear, and allows us to think about mitigation strategies. The added value of AI is often incremental - we build better credit risk models, detect more fraud, or price derivatives faster. Capturing that incremental value sustainably over time means that we avoid model failure that would annihilate that value instantaneously.

Showing consistent behavior will also generate trust in AI, which is another barrier to its wide adoption. People have to trust a machine, and this is only thinkable when ML behavior can be explained and when it shows consistent performance over a long period of time. When we board an aeroplane, we put our faith in the hands of the engineers who have designed the machine by accurately controlling the risks involved. When we build AI to perform surgery, we need a similar mindset that puts the risks first.

This is our vision. This is why we have created Yields.io.

Jos Gheerardyn, May 28 2019

Good scenario generation for better model risk management

In my current post I would like to share a few thoughts about scenario generation as I believe they are crucial when analyzing mathematical models.

Property based testing

First of all, as with any piece of software, an algorithm requires extensive testing. One particularly useful approach is so-called property-based testing. The epitome of a library implementing this line of thinking is QuickCheck:

QuickCheck is a library for random testing of program properties. The programmer provides a specification of the program, in the form of properties which functions should satisfy, and QuickCheck then tests that the properties hold in a large number of randomly generated cases. Specifications are expressed in Haskell, using combinators provided by QuickCheck. QuickCheck provides combinators to define properties, observe the distribution of test data, and define test data generators.

the challenges of realistic testing

Generating the actual test cases is an important feature of a property based testing approach. Traditionally testers use existing (i.e. historic) data or alternatively a generator to create synthetic cases.

These ideas are directly relevant to financial engineering. When building e.g. a FX option volatility surface calibration, it makes sense to verify that the fitting algorithm works equally well on a Monday as it does on a Friday. This is a test that can be run easily using historical data. On the other hand, when we want to verify if our new ML-based recommendation engine works correctly on a more noisy dataset, this can be tested using synthetic data generators.

Libraries like QuickCheck or ScalaCheck can be used to generate scenarios for testing mathematical algorithms. However, the off-the-shelf generators in these open source initiatives are still fairly basic and do not allow for more subtle quantitative analysis so custom work is still needed.

Risk management

There is however another reason why scenario generation is important. When making informed decisions, analyzing the outcome of plausible scenarios is extremely helpful and one of the basic concepts of risk management. Hence, when managing model risk it is crucial to understand how mathematical models behave under various scenarios. Hence, rather than using scenarios to detect bugs in the algorithms, we can use scenarios as well to understand behaviour of the models we are testing. Below we give a few considerations to illustrate the added value of scenario generation in the context of risk management.

  • When validating mathematical models, it is important to determine under what conditions they will break. This was illustrated during the great financial crisis when many banks were unable to use their derivatives pricing models because the market was in a dislocated condition which was an unreachable state by the underlying diffusion models. Determining this so-called region of validity requires efficient generation of test scenarios.
  • Regulators analyse capital requirements of financial institutions through stress tests. In order to do this accurately, it is important to create a sufficiently complete set of scenarios to sample possible future outcomes as accurately as possible.
  • Sensitivity analysis of a model helps to understand which parameters play an important role. However, more often than not, sensitivity analysis is only performed in the neighbourhood of the current operating point of the model. In that case, the reaction of the algorithm on more global changes is hard to predict. Scenario generation on the other hand can be used for global sensitivity analysis.
  • From a model risk management point of view, it is important to understand the impact of changing the assumptions of the model which is why we create benchmark models. Once created, it is crucial to understand where the models diverge from each other. Sampling this precisely requires generating a dense enough set of scenarios.
  • Finally, in model development, it may occur that there is not enough data to accurately train an algorithm. In that case, data augmentation techniques can be used to generate alternative datasets.

    For all of these reasons, scenario generation is a prerequisite for proper model risk management.


Being able to test models on various scenarios, both historical and synthetic, is an important aspect of validation. We see interesting projects in this space but a lot of work is still needed to make efficient scenario generation easily accessible to both model developers and validators.

Jos Gheerardyn

Model Risk, AI and Machine Learning Events of 2019

2019 is here and various events in the area of Risk, AI and Banking are waiting for us! Yields.io created a list of some of these conferences, so you can start preparing for this year!

Conference AI Masters, 24-25 January, 2019 | Berlin, Germany 

AAAI 2019 : AAAI Conference on Artificial Intelligence, 27 January - 1 February, 2019 | Hawaii, USA

13rd Edition Model Risk, 28-30 January, 2019 | San Francisco, USA

Paris Fintech Forum, 29-30 January, 2019 | Paris, France (Yields.io was on stage)

Banking Operational Risk Management Summit, 12-14 February, 2019 | Vienna, Austria

Finastra Universe Paris, 20-21 February, 2019 | Paris, France (Yields.io was on stage)

Machine Learning Prague 2019, 22-24 February, 2019 | Prague, Czech Republic

GARP Risk Convention, 25-27 February, 2019 | New York City, USA

Model Risk Model Management, 27-28 February, 2019 | London, UK

Quant Summit Europe, 5-8 March, 2019 | London, UK

5th Annual New Generation Operational Risk: Europe, 12-13 March, 2019 | London, UK

The 8th XVA Conference, 13-15 March, 2019 | London, UK (Yields.io was on stage)

6th XVA, Risk, Clearing and Collateral Congress, 14 March, 2019 | Hanover, Germany (Yields.io was on stage)

The 3rd Machine Learning & AI in Quantitative Finance Conference, 20-22 March, 2019 | London, UK (Yields.io was on stage)

Model Risk Management: Pricing and Non-Pricing Models, 28-29 March, 2019 | Toronto, Canada (Yields.io was on stage)

FinTech Horizons, 3 April, 2019 | San Francisco, USA (Yields.io was on stage)

Applied Machine Learning Conference, 11 April, 2019 | Charlottesville, USA

Model Risk Management in Banking, 24-26 April, 2019 | London, UK

QuantMinds International, 13-17 May, 2019 | Vienna, Austria (Yields.io was on stage)

Rise of AI Conference, 16 May, 2019 | Berlin, Germany

ICML 2019: International Conference on Machine Learning, 10-15 June, 2019 | Long Beach, USA

8th Annual Risk EMEA Summit, 11-12 June, 2019 | London, UK

Delivering AI and Big Data for a Smarter Future, 19-20 June, 2019 | Amsterdam, Netherlands

Model Risk Management Europe, 20-21 June, 2019 | London, UK (Yields.io will be on stage)

12th Edition Model Risk, 25-27 June, 2019 | New York, USA

Model Risk Management Summit, 26 June, 2019 | London, UK (Yields.io will be on stage)

ECMLPKDD 2019: European Conference on Machine learning and knowledge discovery in databases, 16-20 September, 2019 | Würzburg, Germany

6th World Machine Learning and Deep Learning Congress, 14-15 October, 2019 | Helsinki, Finland

The 15th Quantitative Finance Conference, 14-15 October, 2019 | Rome, Italy (Yields.io will be on stage)

Women in Quantitative Finance Conference (WQF), 17 October, 2019 | London, UK

WebSummit, 4-7 November 2019 | Lisbon, Portugal

2019 Canadian Risk Forum, 11-13 November, 2019 | Montreal, Canada

6th Edition IFRS 9 Forum, 2019 | London, UK

Follow us on Twitter to have real time updates of the events that we will be attending.

Feel free to email us with more Model Risk, AI and Machine Learning events!

Joana Barata

Prioritizing model validation tasks

Large banks have over 3000 models in production, and all of these algorithms require periodic review (re-validation). Given that an average validation typically takes around 4 – 6 weeks, this is a huge undertaking. Cycling through all models in a sequential manner might therefore not be the best solution because different models carry a different amount of model risk. A better strategy is so-called risk-based prioritization.

In this approach, a validation team organizes the (periodic) review by taking into account the risk associated to each given model. This is most often implemented by using so-called model risk tiering. The tier of a model is determined through qualitative and quantitative measures such as materiality, risk exposure and regulatory impact. Typically, qualitative assessments such as model complexity are fairly constant over time. Quantitative metrics such as the amount of risk that is managed through the model can change quickly since those quantities are dependent on the context in which the model is used. This is the main driver for using quantitative methodologies as part of the model risk tiering.

Frank H. Knight, the university of Chicago Centennial catalogues

When it comes to quantification, there are a few important concepts that have been introduced long ago by Knight. Let’s assume we have a quantity that we would like to model (such as the NPV of a swaption). This quantity has a (finite or infinite) number of potential outcomes. We talk about uncertainty when we do not know the probability of each of the outcomes, while we use the term risk when we do have such knowledge.

When it comes to building models, let us assume that we have a set of candidate models that can be used to describe the behavior of the quantity of interest. When we know the likelihood that a given model is the right one, we talk about model risk, while otherwise we use the concept of model uncertainty. We speak about model ambiguity whenever there are multiple probability measures available on the set of models P.

In case we know the likelihood of different outcomes, we can apply standard management practices like diversification to manage the risk. When a probability distribution however is not known, a typical approach is to work off the worst case and put aside a certain amount of capital as an insurance premium.

In the case of model risk quantification, a fairly standard approach is to use Bayesian analysis. In that case, we start by introducing for every model a prior distribution over the model parameters indicating the likelihood that any particular set of model parameters is correct. As we moreover have multiple models (each with their own set of parameters) we also have to assign a prior distribution indicating how likely it is that a given model is the correct one. This procedure allows for expert input since a model validator can quantify which model, according to her or his experience, is the right one.

As we then observe different outcomes of the quantity that is being modeled, we can compute the posterior probability on each candidate model which is the likelihood that the model is true, after having observed the outcomes x. This allows us to update our knowledge about the various models as we collect more observations. Using these Bayesian probability distributions, we can then compute model dependent quantities by taking the expectation over all available models and integrating over the distribution of the parameters. Concretely, by applying this approach we can compute the expected value of an option as it would be computed through a Black-Scholes formula, Heston model and a local volatility model. The same approach also allows us to compute uncertainty measures on this expected value such as the standard deviation on the NPV of an option which could serve as a measure of model risk.

When there are no probability distributions available, we can quantify model uncertainty using the so-called max min principle. Assume that an agent has several scenarios at his disposal. With the max-min principle, he will choose the scenario that maximizes his expected utility. For the expected utility, we use the worse case over all available models (the minimal outcome). Concretely, if we are long an option, and we have Heston, local volatility and Black-Scholes available, each with say three possible parameter choices, then according to the above, we should value the option by applying the formula (and parameters) that lead to the lowest possible NPV (let us call this NPV[w]). If moreover we have the possibility to sell the option at a price P, then using the max min principle, we will sell whenever P>NPV[w].

In conclusion, the need to organize models by their riskyness is driven by the fact that banks want to optimally allocate resources to the right models. Although such model tiering is driven both by qualitative and quantitative features, we have focused on the latter since these aspects tend to vary in a more dynamic fashion. We have highlighted two well-known methods. The Bayesian approach leads to more stable outcomes (due to model averaging) but is hard to implement (since assigning prior distributions is a difficult task). The max min (i.e. worst case) approach is often encountered in the industry but due to its nature may lead to more volatile outcomes.

Jos Gheerardyn

Generative adversarial networks

Generating realistic data is a challenge that is often encountered in model development, testing and validation. There are many relevant examples. In valuation modelling, we need market data such as interest rate curves and volatility surfaces. These objects have an intricate structure and strong constraints from non-arbitrage conditions. Hence, generating them randomly in a naive fashion is bound to fail. Random curves and surfaces could indeed expose issues with a model but since such configurations are extremely unlikely to realize, the added value of a test like this is rather low.

Another scenario where generation is needed, is the case of sparse data. When e.g. studying low default portfolio's, by definition there is not a lot of data to train the models on. Hence, proper data generation techniques are extremely important.

One traditional approach, especially in the context of time series, is to perform a principal component analysis on a set of data samples (such as IR curves). Once the principal components are determined, we can then construct the distribution of the strength of every component. Generating a new sample then simply means that we sample from these distributions and reconstruct the data using the components. Such an approach works fine for data sets that are not too large. However, for big datasets, or in case the data does not have a time dimension, we need other techniques.

Generative Adversarial Networks (see Goodfellow I. et al) are a very powerful technique to build extremely realistic data sets. The algorithm consists of two neural networks. A first network, the generator, creates candidates while the discriminator attempts to identify wether the candidate originated from the real dataset or if the generator created a synthetic sample. By repeating this procedure, the generator becomes more and more accurate in creating realistically looking candidates while the discriminator becomes better at identifying deviations from the real dataset. Once the system is trained, one can use the generator to create very realistic samples.

Apart from generating realistic datasets, GAN's have many more applications in finance. To end with one interesting use case, in this paper Hadad et al. use GAN's to decompose stock price time series in a market and an idiosyncratic component.

Jos Gheerardyn

Sparse data challenges in IFRS 9

Building mathematical models requires data. Validating algorithms needs an even larger sample set because one has to determine if a model works well when used in real life (on data that the model has never encountered before).

For institutions building or validating credit models, this is often challenging on many fronts, first of all because a credit model requires long historical data spanning multiple economical cycles. Operationally, long timeframes imply that the data has been stored in different systems, with different schema and conventions. To use the data, one therefore needs proper data governance.

The very nature of credit data also leads to many intricate mathematical problems. A typical example is related to building a model for a low default portfolio. Such model needs actual defaults to train, and these are by definition sparse which leads to various problems such as class imbalance. To illustrate class imbalance, let us look at a credit model as a simple classifier, i.e. a model that determines whether or not somebody is going to default in the next quarter. Suppose that our data contains .01% defaults. In that case, a model that simply predicts that nobody at all will go bankrupt already has a 99.99% accuracy.

In IFRS 9, the situation is even more challenging. As discussed in a previous post, IFRS 9 recognizes two types of performing credit exposure. Stage 1 exposures have experienced no significant change in credit quality since origination and impairments are based on a one-year expected credit loss (ECL). Stage 2 exposures have experienced significant deterioration and impairments will be based on lifetime ECL—that is, the probability of defaulting during the whole life of the exposure, taking into account current and future macroeconomic conditions. This lifetime ECL requires long datasets. As an example, assume that in order to calibrate our credit model, we would need 10Y of data. If we would be building a regression model for a 20Y forward default, then naively we would need at least 30Y of data. This is often hard to come by.

There are several approaches that modelers can take to overcome this sparse data challenge and we list a few below.


The first and most natural approach is to extrapolate to longer time scales. This can be done by e.g. assuming time homogenous behavior past the point for which we have enough data (the data horizon). This could lead to keeping the default rate constant or to having a constant markov chain model to govern rating migration. When validating such assumptions, time homogeneity should be verified statistically.

Data augmentation

Another approach to yield more data is so-called data augmentation which is a technique to build more training data by applying transformations on the existing dataset.

Data augmentation is often used in image processing where one applies e.g. mirroring, rotation, scaling and cropping to generate many more (labelled) samples. When dealing with credit data, similar transformations can be applied. As an example, if the dataset would include geographical information, we can e.g. shift the address to generate a new sample. In more general terms, if we discover (or assume) a symmetry in the dataset, we can leverage it to generate many more samples by applying that symmetry transformation on the data.

Another data augmentation technique is to add various levels of noise. This is an interesting technique in its own to avoid overfitting.

When validating models that rely on data augmentation, we have to verify the assumptions that underpin the data generation itself.

Synthetic data generation

We can generalize the above more by using mathematical models to actually generate the data. This idea is closely related to so-called generative adversarial networks, a topic of active research.

This approach means that in order to test whether our credit model can forecast defaults, we verify that it would work on data that was generated by another model. The generator itself will be parameterized and we can sample various parameter configurations to make sure that the credit model works well on a wide range of datasets.

In order to validate such models, we have to verify that our real data is consistent with the generator.


In this post, we have reviewed a few techniques to deal with sparse data. Every approach has some very specific underlying assumptions that have to be validated carefully in order to guarantee that the model itself can be trusted.

Jos Gheerardyn

The consequences of IFRS9 on Model Risk

IFRS 9 is an international financial reporting standard that was published in July 2014 by IASB and went live this year. In the present post, we discuss a few key-aspects of the framework that are particularly relevant from a model risk management point of view.

Fair value through profit or loss

The new standard has a large impact on the stability of the reported profit and therefore on capital consumption and profitability. This PnL volatility is caused in part by the fact that IFRS 9 relies more heavily on valuation models. Under IFRS 9, a financial instrument has to meet two conditions to be classified as amortized cost: the business model must be “held to collect” contractual cash flows until maturity, and those cash flows must be solely payment of principal and interest (the SPPI criterion). Financial instruments that do not meet this criterion will be classified at fair value, with gains and losses treated as other comprehensive income (FVOCI) or through profit or loss (FVTPL). Hence, for those financial instruments, the value will be constantly adjusted to the current market value. In order to generate stable PnL, financial institutions need to put in place checks to guarantee that the market data feeding the valuation models is of high quality. Moreover, the pricing models themselves have to be rigorously tested in order to avoid unwanted volatility due to e.g. unstable calibration or poor numerical convergence.

The second reason for increased variability of the reported profit is more related to changes in the impairment model. IFRS 9 recognizes two types of performing credit exposure. Stage 1 exposures have experienced no significant change in credit quality since origination and impairments are based on a one-year expected credit loss (ECL). Stage 2 exposures have experienced significant deterioration and impairments will be based on lifetime ECL—that is, the probability of defaulting during the whole life of the exposure, taking into account current and future macroeconomic conditions. Although stage 2 obviously implies a higher default risk and therefore a shorter expected lifetime of the exposure, the transition from stage 1 to stage 2 typically introduces a discontinuous jump in the expected credit loss.

Due to larger credit provisions caused by lifetime expected credit loss for exposures in stage 2, profitability of transactions for clients with higher risk of migrating to stage 2 is under pressure. This is why some banks are developing asset light "originate to distribute" business models, where these products are originated for distribution to third-party investors. This however requires financial institutions to be able to compute the fair value of each individual corporate loan in real time, to respond to market opportunities quickly. The valuation models that can be used for this type of pricing require large amounts of data and fairly complicated pricing routines. In order to guarantee sustained commercial viability of such business model, it is therefore important to continuously monitor both data and models to avoid mis-pricing.

Pricing at origination

Financial instruments that migrate to stage 2 require higher credit loss provisions and therefore consume more capital. In order to sustain profitability, banks must take this into account when pricing new transactions. Such a valuation strategy has a few key challenges. First of all, one needs to be able to compute the probability of migrating between the various stages.This can e.g. be done through a monte carlo computation where one simulates migration over the lifetime of the portfolio and computes the resulting ECL continuously. In the simulation, this then leads to provisions (and hence a(n opportunity) cost) through time which is used to compute a cost of origination. This approach leads to smooth pricing behaviour since the simulation weighs all possible scenario's proportional to their likelihood in a continuous fashion.

Another challenge related to pricing is the fact that banks have to take into account the current economical environment for the ECL computation. This means that we need to use so-called point in time (PIT) estimates for the probability of default as opposed to the more standard through the cycle (TTC) estimates that are used in the context of Basel. Creating PIT from the long-range TTC estimates requires the introduction of a seasonal component that can be noisy to estimate. It is however important for banks to fully understand the uncertainty in the PIT estimate because an inaccuracy will lead to unanticipated volatility in the credit loss provisions which will pressure the long term profitability of the client base.

Early warning systems

Apart from introducing flexible pricing that would allow banks to incorporate the cost of increased PnL volatility, institutions might also introduce early warning systems that detect exposures at a high risk to migrate to stage 2. Such system however requires a sophisticated algorithm that can first of all compute this migration probability accurately but also can evaluate the impact of remediating actions with a sufficient degree of certainty. Building such models therefore requires a large historical dataset, with high quality data.


IFRS 9 has prompted a flurry of activity in mathematical modelling, first of all because impairments are accounted differently. However, financial institutions have quickly realized that this new standard impacts profitability paving the way to new business models, workflows and practices. Many of these new initiatives heavily rely on sophisticated mathematical models. Managing properly the risk associated to these models will give banks a sustainable competitive edge.

Jos Gheerardyn

Fairness and AI

More and more companies are automating processes with the help of ML. This has tremendous advantages since an algorithm can scale in an almost unlimited fashion. Moreover, in case rich datasets are available, it can often detect patterns that humans cannot discover. As a well-known illustration, think about how Deepmind used a ML algorithm to optimize energy consumption of Google's data centers. When algorithms are used to make important decisions that impact people's lives, such as deciding on medical treatment, granting a loan, or performing risk assessments in parole hearings, it is of paramount importance that the algorithm is fair. Because the models become ever more complicated, this is not easy to assess. As a consequence, both the public, legislators and regulators are aware of this issue; see e.g. a report on algorithmic systems, opportunities and civil rights by the Obama administration and (in Europe) recital 71 of GDPR.

In order to detect bias/unfairness it is important to come with a proper (mathematical) definition to make sure that we can measure deviations. Below we describe a few alternatives.

Let us assume we are building a model to determine whether somebody may receive a loan. Typically such model will use information (so-called attributes) such as your credit history, marital status, education, profession, etc. in order to estimate the probability that you will be able to pay off your debts. The dataset will also include historical data on defaults (i.e. people who were not able to repay their loan). Now, given the above, the financial institution wants to make sure that the algorithm is fair with respect to gender, skin colour, etc. These are called protected attributes.

A traditional solution is to simply remove these protected attributes from the dataset altogether. Of course, when testing such model, simply changing the value of any of the protected attributes is not going to impact to output of the model. However, through so-called redundant encodings an algorithm might be able to guess the value of the protected attributes from other information. As an example, let us assume we train our credit model on a dataset representing the people of Belgium. The dataset has a protected attribute called language that can take two values (French or Dutch). People whose language attribute equals Dutch live primarily in the Northern part of the country while the French speaking citizens live in the Southern part. Hence our algorithm can infer the language attribute (statistically) by looking at the address. Therefore, simply removing this attribute from the dataset does not help.

Another approach is so-called demographic parity. In that case one requires that the membership of a protected attribute is uncorrelated with the output of the algorithm. Let us assume that granting a loan is indicated by a target binary variable Y = 1 (the ground truth) and that the protected (binary) attribute is called A. The forecast of the model is called Z. Demographic parity then means that

Pr[ Z = 1 | A = 0] = Pr[ Z=1 | A=1]

This notion however is also flawed. A key issue with the approach is that if the ground truth (i.e. default in our example) does depend on the protected attribute, the perfect predictor (i.e. Y=Z) cannot be reached and therefore the utility and predictive power of the model reduce. Moreover, by requiring demographic parity, the model has to yield (on average) the same outcome for the different values of the protected attribute. In our example, demographic parity would imply that the model would have to refuse good candidates from one category and accept bad candidates from the other category in order to reach the same average level. Concretely, assuming that the Dutch speaking people historically have defaulted on 3% of their loans while the French speakers only on 1%, demographic parity would typically be in the disadvantage of the Walloons.

A more subtle suggestion for fairness was proposed by Hardt et al:

Pr[Z=1 | Y=y, A=0] = Pr[Z=1 | Y=y, A=1]

for y=0,1. In other words, relatively speaking the model has to be right (or wrong) as often for either value of the protected attribute. This definition incentivizes more accurate models and the property is called oblivious as it depends on the joint distribution of A,Y and Z.

As can be seen from the above, mastering fairness in ML/AI is crucial. This is why many tech giant such as Google, Facebook, Microsoft and IBM have created initiatives to tackle this problem. However, given that bias in algorithms is just one instance of model risk, we believe that the best approach is to separate concerns and hand off bias detection to entities that are independent from the model developers. In other words, we suggest to leverage the separation between first and second line of defence to tackle bias efficiently.

Jos Gheerardyn

AI and model validation

Model validation is a labor-intensive profession that requires specialists who understand both quantitative finance as well as business practices. Validating a single valuation model typically requires between 3 and 6 weeks of hard work focusing on many different topics. In this first article, I would like to highlight a few ideas on how to automate such analyses.

see Global Model Practice Survey 2014 by Deloitte


Models need. Hence, a large portion of time in model validation is spent on managing data. To give a few examples, the validator has to make sure that the input data is of good quality, that the test data is sufficiently rich and that there are processes in place to deal with data issues. Broadly speaking, input data exists in two flavours: time series and multi-dimensional data that is not indexed primarily by time.

Time series

Time series are overly present in finance. Valuation models typically use quotes (standardized prices of standardized contracts) as input data. Since the market changes, time series of quotes are abundant. To determine the quality of quotes one can run anomaly detection models on historical data (i.e. on time series). Many such algorithms exist. Simple ones use time series decomposition like STL to subtract the predictable part and study the distribution of the resulting residue to determine the likelihood of an outlier. More modern alternatives use machine learning algorithms. In one approach, a model is trained to detect anomalies by using labeled datasets (e.g. such as the ODDS database). Another family of solutions uses a ML model to forecast the time series. In that method, anomalies are discovered in case the prediction differs significantly from the realization.

Data representativeness

In some cases, the time component is less (or not) important. When building regression or classification models, one often uses a large collection of samples (not indexed by time) with many features. In that case, due to the curse of dimensionality, one would need astronomical sizes of data to cover every possibility. Most often, the data is not spread homogeneously over feature space but instead is clustered into similarly looking parts. When building a model, it is important to make sure that the data for which the model is going to be used is similar to data on which the model has been trained. Hence, one needs to understand if a new data point is close to existing data. There are many clustering algorithms (like e.g. DBSCAN) that can perform this analysis generically.

Model dependencies

A model inventory is a necessary tool in a model governance process. On top of keeping track of all the models used, it is extremely valuable to also store the dependencies between models. For instance, when computing VaR one need the historical PnL of all portfolios. If these portfolios contain derivatives, we need models to compute their net present value. All the models that are used within an enterprise can be represented as a graph where the nodes are the models while the vertices represent data. Understanding the topology makes it easier to trace back model issues.

model dependency graph


ML can help benchmarking models. First of all, in the case of forecasting models, we can easily train alternative algorithms on the realizations. This will help you to understand the impact of changing the underlying assumptions of the model.

However, even in case such realizations are not available, we can still train a model to mimic the algorithm that is used in production. We can use such surrogate models (see e.g. this paper from our colleagues at IMEC) to detect changes in the behaviour. As an example, suppose we are monitoring a XVA model. We can train a ML algorithm to predict the changes in the XVA amounts of a portfolio when the market data changes. We can still build a model that tries to forecast the change in PnL with market data. Such model can be used to detect e.g. instabilities in the XVA computation. An additional benefit of such approach is that one can estimate the sensitivity from the calibrated model.

Jos Gheerardyn

Interested in a demo?

Lorem ipsum dolorem et arceopara bellum. Lorem ipsum dolorem et arceopara bellum. Lorem ipsum dolorem et arceopara bellum.