The need for data validation
Data quality impacts model output in complex and deceitful ways. The common proverb “garbage in, garbage out” summarizes this, albeit in a rather trivializing fashion. Wrong data may produce wrong results, but this describes only the most straightforward case. Nonsensical data can produce incorrect results that are not obviously garbage: the effects of poor data are often surreptitiously difficult to detect. In addition, just as with any more common software bug, the effects of poor data sometimes only materialize far from the source. Since models can be chained, where the output of model X becomes the input of model Y (e.g. in market risk models) these effects propagate making the root cause very difficult to identify. In addition, models often depend on their data in a highly non-linear fashion, so that even minor data errors can produce arbitrarily serious problems.
This is of course not new and the main reason why model risk management frameworks put so much emphasis on the analysis of the data pipeline. Indeed, as part of any model validation, it is mandatory to validate the data that has been used to develop the model, review the data that is flowing through the algorithm, and verify the controls that are in place to guarantee that the data is clean. Moreover, this part is more often than not the most complicated aspect of a model validation. To take a common example, a credit model that is forecasting the probability of default of a retail client often uses logistic linear regression. Hence, the complexity is not in the regression algorithm but rather in the numerous table merges, data transformations, cleaning procedures and feature selection algorithms that were used to build the dataset that is fed into the final regression procedure.
In addition to having clean data, it is equally important to guarantee that the data that has been used for development and training is similar to the data on which the model is being used. This is called the representability problem which is challenging for two main reasons. First of all, with the advent of ML algorithms, the data itself has become much more intricate, which implies that the standard tests for representability are not sufficient any more. Secondly, when models are used continuously, this means that one has to monitor the data in a continuous fashion too.
All of the above problems are obviously not constrained to credit models but are present throughout the full model portfolio. Indeed, valuation models need high quality derived market data (such as interest rate curves and volatility surfaces), the latter require large collections of
clean market data quotes (that are historically grouped in – highly correlated – time series). Market risk models are extremely sensitive to PnL vectors (since they use the 1% worst returns), etc.