Model Risk and Data – A Symbiosis

The need for data validation

Data quality impacts model output in complex and deceitful ways. The common proverb “garbage in, garbage out” summarizes this, albeit in a rather trivializing fashion. Wrong data may produce wrong results, but this describes only the most straightforward case. Nonsensical data can produce incorrect results that are not obviously garbage: the effects of poor data are often surreptitiously difficult to detect. In addition, just as with any more common software bug, the effects of poor data sometimes only materialize far from the source. Since models can be chained, where the output of model X becomes the input of model Y (e.g. in market risk models) these effects propagate making the root cause very difficult to identify. In addition, models often depend on their data in a highly non-linear fashion, so that even minor data errors can produce arbitrarily serious problems.

Table of contents

1. The need for validation

This is of course not new and the main reason why model risk management frameworks put so much emphasis on the analysis of the data pipeline. Indeed, as part of any model validation, it is mandatory to validate the data that has been used to develop the model, review the data that is flowing through the algorithm, and verify the controls that are in place to guarantee that the data is clean. Moreover, this part is more often than not the most complicated aspect of a model validation. To take a common example, a credit model that is forecasting the probability of default of a retail client often uses logistic linear regression. Hence, the complexity is not in the regression algorithm but rather in the numerous table merges, data transformations, cleaning procedures and feature selection algorithms that were used to build the dataset that is fed into the final regression procedure.

In addition to having clean data, it is equally important to guarantee that the data that has been used for development and training is similar to the data on which the model is being used. This is called the representability problem which is challenging for two main reasons. First of all, with the advent of ML algorithms, the data itself has become much more intricate, which implies that the standard tests for representability are not sufficient any more. Secondly, when models are used continuously, this means that one has to monitor the data in a continuous fashion too.

All of the above problems are obviously not constrained to credit models but are present throughout the full model portfolio. Indeed, valuation models need high quality derived market data (such as interest rate curves and volatility surfaces), the latter require large collections of

clean market data quotes (that are historically grouped in – highly correlated – time series). Market risk models are extremely sensitive to PnL vectors (since they use the 1% worst returns), etc.

Data transformation

Most financial institutions are now engaged in data transformation programs. Next to the clear need to solve data quality issues as mentioned in the introduction (and for which there is also regulatory pressure1), these programs are also driven by the need to reshape business models by harnessing the immense potential of data.

As was mentioned in a recent paper by McKinsey2: “Successful data transformations can yield enormous benefits both through major cost savings totalling 100s of millions as well as by leading to additional revenue streams that often sum to a multiple of the cost savings. Yet many other organizations are struggling to capture real value from their data programs, with some seeing negative returns from investments totaling hundreds of millions of dollars.

In order to maximize a successful outcome of these large projects, it is crucial to quantify the added value of the data from the start since this is the most natural tool to prioritize the multitude of use cases that such data transformation programs have to cover. The only way to accurately estimate that impact is to leverage the model risk management framework. Indeed, a good framework requires validators to estimate the impact of data issues3 and this allows

managers to compute the cost of poor data. Secondly, thorough model risk management frameworks require businesses to document new modeling approaches from the start which caters for a scientific approach to compute the added value of new data initiatives as opposed to undocumented hand-waving arguments that would lead to less accurate decision making.

Our solution

Data quality problems do not only stem from from technological issues since very often, the majority of the problems are caused by human error, such as creating multiple different versions of the same data. As a consequence, data transformation projects require workflow support (to minimize human error) as well as quantitative tools (to detect issues).

Our Yields platform is created with these challenges in mind. First of all, our data lake allows model validators to centralize all data relevant to the validation tasks. Users can add metadata such as a data dictionary to browse the datasets, add technical as well as business schemas to describe and validate incoming data, versioning of datasets, etc.

In addition, we have developed various machine learning techniques to verify data quality, allowing analysts to automatically detect atypical data (such as anomalies), as well as to identify changes in the data distributions as they are fed to the models. These techniques can be run in an unsupervised manner, implying that we do not need to have labelled datasets (which would be very time consuming to create and maintain).

Finally, these tools allow users to segment the data in various fashions, e.g. analyzing how models perform on more extreme datasets. Together with our benchmarking capabilities, these techniques can be used to quantify the impact of poor data.

1 The most well-known framework in this context is of course BCBS 239 – see https://www.bis.org/publ/bcbs239.pdf
2 See https://www.mckinsey.com/industries/financial-services/our-insights/designing-a-data-transform   ation-that-delivers-value-right-from-the-start
3 A concrete approach can be found in a recent study from SNS Bank – see https://essay.utwente.nl/69486/1/Thesis%20Rolf%20de%20Jong%20-%20Public%20version.pdf

Conclusion

Poor data is a massive source of risk. By addressing this problem from within the context of model risk, we can capture the added value immediately. This early detection of value has been a key driver in many successful data transformation projects. As a consequence, such strategy leads both to additional revenue (better models, faster time to market) and cost reduction (faster validation and more efficient model development).

Data transformation projects that are grounded in a good model risk management process, bring value from the start.

About the author

jos_gheerardyn

Jos Gheerardyn has built the first FinTech platform that uses AI for real-time model testing and validation on an enterprise-wide scale. A zealous proponent of model risk governance & strategy, Jos is on a mission to empower quants, risk managers and model validators with smarter tools to turn model risk into a business driver. Prior to his current role, he has been active in quantitative finance both as a manager and as an analyst.

Subscribe to the Yields Newsletter

Stay ahead with expert articles on MRM and AI risk topics, in-depth whitepapers, and Yields company updates.