Building mathematical models requires data. Validating algorithms needs an even larger sample set because one has to determine if a model works well when used in real life (on data that the model has never encountered before).

For institutions building or validating credit models, this is often challenging on many fronts, first of all because a credit model requires long historical data spanning multiple economical cycles. Operationally, long timeframes imply that the data has been stored in different systems, with different schema and conventions. To use the data, one therefore needs proper data governance.

The very nature of credit data also leads to many intricate mathematical problems. A typical example is related to building a model for a low default portfolio. Such model needs actual defaults to train, and these are by definition sparse which leads to various problems such as class imbalance. To illustrate class imbalance, let us look at a credit model as a simple classifier, i.e. a model that determines whether or not somebody is going to default in the next quarter. Suppose that our data contains .01% defaults. In that case, a model that simply predicts that nobody at all will go bankrupt already has a 99.99% accuracy.

In IFRS 9, the situation is even more challenging. As discussed in a previous post, IFRS 9 recognizes two types of performing credit exposure. Stage 1 exposures have experienced no significant change in credit quality since origination and impairments are based on a one-year expected credit loss (ECL). Stage 2 exposures have experienced significant deterioration and impairments will be based on lifetime ECL—that is, the probability of defaulting during the whole life of the exposure, taking into account current and future macroeconomic conditions. This lifetime ECL requires long datasets. As an example, assume that in order to calibrate our credit model, we would need 10Y of data. If we would be building a regression model for a 20Y forward default, then naively we would need at least 30Y of data. This is often hard to come by.

There are several approaches that modelers can take to overcome this sparse data challenge and we list a few below.

## Extrapolation

The first and most natural approach is to extrapolate to longer time scales. This can be done by e.g. assuming time homogenous behavior past the point for which we have enough data (the data horizon). This could lead to keeping the default rate constant or to having a constant markov chain model to govern rating migration. When validating such assumptions, time homogeneity should be verified statistically.

## Data augmentation

Another approach to yield more data is so-called data augmentation which is a technique to build more training data by applying transformations on the existing dataset.

Data augmentation is often used in image processing where one applies e.g. mirroring, rotation, scaling and cropping to generate many more (labelled) samples. When dealing with credit data, similar transformations can be applied. As an example, if the dataset would include geographical information, we can e.g. shift the address to generate a new sample. In more general terms, if we discover (or assume) a symmetry in the dataset, we can leverage it to generate many more samples by applying that symmetry transformation on the data.

Another data augmentation technique is to add various levels of noise. This is an interesting technique in its own to avoid overfitting.

When validating models that rely on data augmentation, we have to verify the assumptions that underpin the data generation itself.

## Synthetic data generation

We can generalize the above more by using mathematical models to actually generate the data. This idea is closely related to so-called generative adversarial networks, a topic of active research.

This approach means that in order to test whether our credit model can forecast defaults, we verify that it would work on data that was generated by another model. The generator itself will be parameterized and we can sample various parameter configurations to make sure that the credit model works well on a wide range of datasets.

In order to validate such models, we have to verify that our real data is consistent with the generator.

## Conclusion

In this post, we have reviewed a few techniques to deal with sparse data. Every approach has some very specific underlying assumptions that have to be validated carefully in order to guarantee that the model itself can be trusted.

**Interested in learning more? Download here our essay Model risk and data – a symbiosis.**