Validating Prediction Models (Based on Prediction Errors)

Machine Learning (ML) and Artificial Intelligence (AI) are integral to most businesses nowadays. Decision-makers in many sectors, e.g., banking and finance, have employed ML-algorithms to do their heavy lifting. Though it sounds like a smart move, it is imperative to make sure these models are indeed doing what is expected of them. As are employees of a bank, models are prone to making mistakes and there is always a price to pay when using ML-models. Thus the need to continuously validate these models.

Model validation is a broad field expanding several notions. In this article, we focus on prediction models. A model may be considered valid if:

  1. it performs well, i.e., based on some mathematical metrics such as “a small miss-classification error” in classification prediction models.
  2. the model is fair, i.e., not racist, sexist, homophobes, xenophobe, etc. These are cultural or legal validity aspects.
  3. the model is interpret-able, some experts may argue that black-box models are invalid. In such cases understanding how the model makes predictions is central.

The three validity aspects above are not exhaustive; model validation may mean different things depending on who is validating the model.

In this article, we will discuss model validation from the viewpoint of

  1. Most data scientists when talking about model validation will default to point.
  2. Hereunder, we give models details on model validation based on prediction errors.

Validating prediction models based on errors in prediction

Before making any progress, we will introduce some notations here: :

Y: represents the outcome we want to predict, let’s say something like stock prices on a given day. We will denote the predicted Y with Ŷ.

x: represents the characteristics of the outcome — we will always know x at the time of prediction. For our stock example, x can be a date, open and closing prices for that date and so on.

m: represents a prediction model, in practice, this model will often contain parameters based on estimations. Once we estimate these parameters, we then denote the model with estimated parameters with  — this will differentiate  from m, the model with the true parameters.

β: will represent parameters of the model m, as we already know, we will estimate β and represent it with ˆβ.

We calculate predictions as follows:

$$\hat Y(x) = \hat m (x) = x^t\hat \beta $$

and want the prediction error to be as small as possible. The prediction error for a prediction at predictor x is given by

$$\hat Y(x)-Y^{\star}$$

Y* is the outcome we want to predict that has x as characteristics. Since a prediction model is typically used to predict an outcome before it is observed, the outcome.

Y* is unknown at the time of prediction. Hence, the prediction error cannot be computed.

Recall that a prediction model is estimated by using data in the training data set (X, Y) and that Y* is an outcome at x which is assumed to be independent of the training data. The idea is that the prediction model is intended for use in predicting a future observation Y*, i.e. an observation that has yet to be realized/observed (otherwise prediction seems rather useless). Hence, Y* can never be part of the training data set.

Here we provide definitions and we show how the prediction performance of a prediction model can be evaluated from data.

Let T= (Y, X) denote the training data, from which the prediction model is built. This building process typically involves feature (characteristic) selection and parameter estimation. Below we define different types of errors used for model validation.

Test or Generalisation, out of sample Error

The test or generalization error for prediction model is given by

$$\text{Err}_T = \text{E}_{(Y^\star,X^\star)}\big\{(\hat m(X^\star)-Y^\star)^2|T\big\}$$

where (Y*, X*) is independent of the training data.

The test error is conditional on the training data T. Hence, the test error evaluates the performance of the single model built from the observed training data. This is the ultimate target of the model assessment because it is exactly this prediction model that will be used in practice and applied to future predictors X* to predict Y*. The test error is defined as an average overall such future observations (Y*; X*). The test error is the most interesting error for model validation according to point 1.

Conditional Test Error in x

The conditional test error in x for a prediction model is given by

$$\text{Err}_T = \text{E}_{(Y^\star)}\big\{(\hat m(x)-Y^\star)^2|T,x\big\}$$

where Y* is an outcome at predictor x, independent of the training data.

In-sample Error

The in-sample error for a prediction model is given by

$${ \text{Err}}_{inT} = \frac{1}{n}\sum_{i=1}^n{ \text{Err}}_{T}(x_i)$$

i.e., the in-sample error is the sample average of the conditional test errors evaluated in the n training dataset observations.

Estimation of the in-sample error

We start with introducing the training error rate, which is closely related to the mean squared error in linear models.

Training error

The training error is given by

$$\bar{ \text{err}} = \frac{1}{n}\sum_{i=1}^n(Y_i – \hat m (x_i))^2$$

where the (Y, x) form the training dataset which is also used for training the models.

  • The training error is an overly optimistic estimate of the test error
  • The training error never increases when the model becomes more complex — cannot be used directly as a model selection criterion

Model parameters are often estimated by minimising the training error (cfr. mean squared error). Hence the fitted model adapts to the training data, and therefore the training error will be an overly optimistic estimate of the test error.

Other estimators of the in-sample error are:

  • The Akaike information criterion (AIC)
  • Bayesian information criterion (BIC)
  • Mallow’s Cp

Expected Prediction Error and Cross-Validation

The test or generalization error was defined conditionally on the training data. By averaging over the distribution of training datasets, the expected test error arises.

Expected Test Error

\begin{align} \text{E}_T\text{Err}_T &= \text{E}_T\big\{\text{E}_{(Y^\star,X^\star)}\big\{(\hat m(X^\star)-Y^\star)^2|T\big\}\big\}\\ &= \text{E}_{T,Y^\star,X^\star}\big\{(\hat m(X^\star)-Y^\star)^2\big\} \end{align}

The expected test error may not be of direct interest when the goal is to assess the prediction performance of a single prediction model. The expected test error averages the test errors of all models that can be built from all training datasets, and hence this may be less relevant when the interest is in evaluating one particular model that resulted from a single observed training dataset.

Also, note that building a prediction model involves both parameter estimation and feature selection. Hence the expected test error also evaluates the feature selection procedure (on average). If the expected test error is small, it is an indication that the model building process gives good predictions for future observations (Y*, X*) on average.

Estimation of the expected test error


Randomly divide the training dataset into k approximately equal subsets. Train your model on k-1 of them and compute the prediction error on the kth subset. Do this in such a way that all k subsets have been used for training and for computing the prediction error. Average the prediction errors from all k subset. This is the cross-validation error.

The process of cross-validation is clearly mimicking an expected test error and not the test error. The Cross-validation is estimating the expected test error and not the test error in which we are interested in.


Validating prediction models based on prediction errors can get complicated very quickly due to the different types of prediction errors that exist. A data scientist that uses any of these errors during model validation should be conscious of it. Ideally, the model validation error of interest is the test error. Proxies like the in-sample and generalized test error are often used in practice — when used, their shortcomings should be clearly stated and the reason for using them should be also outlined.

If there is enough data, you could, in fact, partition your data into training, validation (for hyper-parameter tuning) and perhaps multiple testing sets. This way good estimates of the test error are possible, leading to proper model validation.

Ready to streamline your model lifecycle?