# Model training and evaluation

Amongst existing “machine learning as a service” tools, Graphext aims to be the easiest and most intuitive tools you can use to quickly train a prediction model on tabular data. While simply fitting a ML model to data is relatively straightforward, whether in Graphext or elsewhere, a *good* strategy to create the *best* model, making the most use of your data, should consider two important aspects:

*Model evaluation:*how to estimate the model’s future performance on unseen data*Hyperparameter tuning*: how to best select certain parameters of the model that are not directly learned from data

We here document Graphext’s strategy for model training, tuning and evaluation, which we hope is flexible enough for most use cases, while not overly complex to understand.

For a more in-depth treatment of some of the topics mentioned here see the references in the final section, particularly “Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning” Raschka (2018) for a more conceptual overview, or scikit-learn’s overview from a coding perspective.

The following sections are organised from most simple to most complete model training strategies. We strive to explain these in a way that is useful to any ML practitioner, while also showing how to configure them in Graphext in particular.

## Introduction

To begin with, let’s clarify the scope of this article. Firstly, we will talk here about supervised ML models mostly; i.e. models which given some samples, each described by a set of features, and corresponding labels, will predict unknown labels for samples it hasn’t seen before. So this could be a model learning from past bank customer’s financial behaviour to predict a person’s credit risk (numerical prediction / regression); or a model predicting whether or not an image contains hot dogs (classification).

Secondly, what we mean by *model evaluation* is estimating how good our model will be at predicting future, unseen data. To do this we need two things:

- A metric, assigning a numerical score to our model indicating how good its predictions are. Metrics are usually calculated by comparing some samples’ true labels with those predicted by the model. This could be something like
*accuracy*(proportion of correctly predicted labels), or the mean squared error. The appropriate metric may depend on the use case (e.g. minimising the rate of false negatives may be more important than false positives in a medical diagnostic test; while the opposite may be true in other scenarios). - Some data the model hasn’t seen during its training. If we evaluated the model using data it already “knows”, we may overestimate how good it will perform on truly new data. Nothing would prevent it from simply memorising the data is has been presented, instead of learning to generalise, i.e. to learn the patterns and relationships between features and labels.

How to make best use of our data to both train and evaluate a model, while making sure our estimate of its performance isn’t overly optimistic, is what we refer to as a *training strategy*. Let’s start with the simplest possible strategies and build up towards the more complicated cases step-by-step.

## Simple model training and evaluation

Let’s first consider training a model’s internal parameters only, while leaving its hyperparameters fixed, e.g. using its default values, or selecting them manually based on experience (we’ll talk more about hyperparameters in later sections).

### No evaluation

In principle, we could simply use all our data to train a model, without evaluating its performance. I.e. the simplest possible (but not advisable) training strategy is simply:

Training a model without evaluating it. Don't try this at home!

Here by “entire dataset” we mean a tabular dataset that contains N samples as rows, M features as columns, and a corresponding set of labels (one per sample). In code, these are often referred to as `X`

(NxM samples) and `y`

(N labels).

So in the simplest case, we simply pass our model all available samples and labels to learn from. We will have no idea if it’s predictions are any good. Or not yet. The only imaginable use case for this would be if additional data for evaluation would become available later, separately, so that at this point in time all we can do is fit our model blindly.

If we measure the “complexity” of our training strategy by the number of times a model is fit to data, this simplest strategy has a complexity of 1, since we fit the model exactly once using all data.

### Holdout method

If all we need is some unseen data to evaluate our model, the simplest possible strategy is to split the dataset into two parts. We use one part to train our model and the other to evaluate it:

Holdout method for model evaluation. This strategy has two steps. A/ Split the dataset into two parts, the train and the test split. Train the model using samples in the train split. Then evaluate the model using a metric of choice on samples and corresponding labels in the test split. B/ Make use of the entire dataset to train the final model.

Note that when we refer to *splits* of the data, each split is meant to contain both the samples and their features, as well as the corresponding labels. Here, in a first step, the model is fit using samples and labels in the **train** split. The fitted model is then used to make predictions for the samples in the **test** split. These predictions are compared with the test split’s true labels to calculate the estimated generalisation performance of the model.

Now that we have an “unbiased” performance measure for our model (i.e. one calculated using unseen data), we are free to use all available data to create the final model. So in a second step we fit the same model again using the *entire* dataset.

Note that since the final model has been trained with *more* data than was
used for its evaluation, if at all it should be slightly *better* than what we
estimated in the first step. But this is a much better situation than
potentially having a *worse* model, which could happen if we evaluated our
model without first reserving some test data.

In terms of complexity, in this strategy the model needs to be fit twice (once for evaluation and once more for the final model), so we could say its complexity is 2.

### Cross-validation

When we have a lot of data, we may simply reserve a proportion of the data for testing and use the remaining data to fit its parameters, as mentioned above. However, when a dataset is already small, this means fitting the model on an even smaller part of it, which may not be enough given its complexity (usually, the more parameters a model has the more data is needed to optimise it). In addition, evaluating the model on a single random (and small) proportion of the original dataset may result in unreliable estimation (high bias), as it is not guaranteed that the distribution of data in the test part is similar to that in the training part (or to the greater “population” the samples come from).

To remedy this, a common method to *evaluate* a model’s performance on limited data, and the one used by default in Graphext, is to use *cross-validation (CV)*.

**K-fold cross-validation**

Perhaps the most common form of cross-validation is the K-fold CV. The idea here is to split the dataset into K *folds*, and then use K-1 folds for fitting the model and the remaining fold to evaluate the generalisation performance of the model on data it hasn’t seen before.

For example, in a 5-fold cross-validation we divide the dataset into 5 equal-sized, non-overlapping parts, each containing 20% of the samples. We then run 5 iterations and in each:

- select 4 parts of the dataset (80%) to fit the model
- select 1 part of the dataset (20%) to evaluate its performance

Cross-validation for model evaluation. The dataset is divided into 5 equal parts (folds). In each iteration we take 4 folds to train the model, and 1 fold to evaluate its performance using some error metric. The estimated generalisation performance then is the average of the metric over the test folds.

We then report the average of the model’s performance on the 5 test folds as the expected performance of the model on unseen data. The advantage of this method is that the model is guaranteed to get evaluated on all available samples.

A complete strategy for using k-fold cross-validation to train and evaluate our model then looks like this:

3-fold cross-validation strategy. The data is split 3 times into 3 equal parts. In each iteration 2 folds are used for training and 1 for evaluation. The final model is trained on the whole dataset.

We use k-fold cross-validation to estimate the model’s performance, and then fit it again using the whole dataset. The “complexity” of this strategy is thus K + 1, where K is the number of folds in the cross-validation.

To stress a point already made above, to make best use of all the data available, the final model will always be fitted again on the entire dataset. This means the estimated performance may be slightly pessimistic, as the model may not have reached its maximum capacity when fitted with only ⅘ of the dataset, for example (perhaps with more data the model would do better).

**Repeated holdout cross-validation**

An alternative to *k-fold* cross-validation is to independently split the dataset k times into two random train and test sets. E.g. 5 such *shuffle-split* iterations, maintaining a proportion of 80% training data and 20% test data, may look like this:

Repeated holdout (shuffle-split) for model evaluation. Instead of dividing the dataset into k equal parts, we simply split it k times into the desired training and testing proportions randomly.

Note that in this case the train/test proportion is independent from the number of splits, i.e. we have the flexibility to e.g. evaluate the model 50 times on random 75%–25% splits of the data. However, it is not guaranteed that the model sees all the data in the process, nor that the splits are different from each other. This method is also sometimes called *shuffle-split* or *Monte Carlo cross-validation*.

Schematically, the complete strategy for using repeated holdout validation to train and evaluate our model then would be:

5-fold repeated holdout (shuffle-split) strategy. This strategy splits the dataset 5 times into desired training and testing proportions randomly. The final model is trained on the whole dataset.

The complexity of this strategy is also $K + 1$, as all that’s changed is how we split the data, not the number of times we do it.

## Model tuning and evaluation

Many ML models have so-called *hyperparameters* that determine *exactly how* the model learns from data. Basic regression models e.g. fit their coefficients to data such as to minimise a certain loss metric. A *regularised* regression additionally implements a penalty on the coefficients (e.g. to keep the coefficients small, or to use fewer coefficients if possible). The strength of this penalty is one such hyperparameter. The maximum allowed depth of a decision tree, or the number of decision trees in a random forest are other examples.

If you’re lucky, the model in question works well out of the box, with all hyperparameters at their default values. Or if you have a lot of experience training a specific kind of model, you may have some intuition about values that work best in specific scenarios. If neither is the case, or you feel your model could or should perform better than what you’re seeing with the default hyperparameters, you may want to *tune* them. *Tuning* here simply means finding their values automatically, and such that the performance of the model is optimised. In practice, this means using data to select the best from a number of *candidate* *models* having different hyperparameter values.

We may be tempted to simply use the same methodology as explained above to find the best hyperparameters and evaluate our model’s performance. We could e.g. fit 3 different model *candidates* using k-fold cross-validation and select the one that on average had the best performance. The question then arises what its estimated performance would be on *unseen data*. If we simply reported the average from our cross-validation, we would be cheating. The estimate would be biased, because we have used the same data to identify the best model (i.e. to select from our candidates and train it), and to estimate its performance. I.e. we haven’t reserved any data to stand in for future, *unseen* data.

The correct way to both *tune* a model’s hyperparameters *and estimate* its generalisation performance, is to use *nested* cross-validation. The general idea is to evaluate the complete training procedure (hyperparameter selection and model fitting) as we would do in a normal cross-validation, but in each iteration of the evaluation, we split the training set again using an inner cross-validation loop to pick the hyperparameters in a robust way.

But instead of directly jumping in to this rather complex strategy, let’s build towards it step-by-step starting from simpler strategies.

### No evaluation (don’t do this) ⛔

As we mentioned above, we don’t recommend fitting or tuning a model without evaluating its generalisation performance. Do this only if you plan to collect more data and evaluate the model later on. Having said that, we *can* tune a model, using the holdout method or cross-validation, and simply report the same performance we used to pick our hyperparameters as a (**bad**) estimate of future performance.

#### Holdout method for tuning without evaluation

The simplest method for tuning our hyperparameters would be to use a single holdout set to pick the best from a set of candidate hyperparameter settings:

Holdout tuning without model evaluation. We select a winning hyperparameter setting by training all candidates on a single <b>train</b> split and comparing their performance on a single <b>test</b> split. We report the winner’s performance (or the average across candidates) as our generalisation metric, and then use its hyperparameters to train the final model on all available data.

This is analogous to the simple holdout method mentioned above, but instead of evaluating a *single* model and reporting its performance on the test set, we evaluate *multiple* model candidates (hyperparameter combinations), and use the test set to pick the best among them.

We might perhaps be tempted then to report either the *average* performance on the test set, or the *best* model’s performance as our expected generalization ability. But this wouldn’t be a good idea. The estimated performance will likely be optimistic, as we used the same data to select between models and to evaluate their generalisation. I.e., we haven’t tested our model training strategy on *unseen* data.

In this strategy our model needs to be fit to data H + 1 times, where H is the number of different hyperparameter combinations to try (H candidates on the training split, and the final model on all data).

#### Cross-validation for tuning without evaluation

We can also use cross-validation instead of the holdout method to select the model’s hyperparameters without (properly) evaluating its performance:

Cross-validation for tuning without model evaluation. We pick hyperparameters by comparing candidate models using cross-validation on the whole dataset, then train the final model with these hyperparameters using all data. “Estimated” performance is the average performance of the winning model from tuning stage.

This works the same as the holdout for model tuning, but instead of comparing different hyperparameter combinations on a single train/test split of the data, we compare them using the average over multiple folds of the data. Note, that since we still don’t test our procedure on unseen data, the same caveats regarding bias and overly optimistic metrics apply here as well.

In this strategy the model needs to be fit (K * H) + 1 times in total, for H different candidates and K folds in our cross-validation. A 5-fold cross-validation for exploring 4 different hyperparameter combinations, for example, would result in a total of 21 model fits.

## Three-way holdout

The simplest strategy to tune *and* evaluate a model on unseen data is the *three-way holdout*. This is a simple extension of the holdout method mentioned in the beginning. It splits the dataset once into dedicated *training*, *validation* and *test* sets. This setup is often used in deep learning contexts, where fitting a single model is very expensive but datasets are huge:

Three-way holdout for model tuning and evaluation. A/ Train candidate models (hyperparameter combinations) on train set. Pick winner by evaluating on eval set. Measure performance of winner on test set. This is the estimated generalisation performance. B/ To train final model, first pick a winner again by fitting models to combined train and eval splits and select based on performance on test set. C/ Train the winner on entire dataset as final model.

We evaluate the performance of our model training strategy by:

- fitting our candidate models on the
**training**set - picking the best candidate by evaluating them using the
**evaluation**set - calculating the final score of the winning candidate on the
**test**set after having re-fit it on the combined**training**and**evaluation**sets

Having an estimate of our model’s performance, we can then apply the same strategy of picking the best hyperparameters using a simple two-way holdout split (combining the train and eval sets), and finally train the model using the best hyperparameters on the whole dataset.

Another way to describe the same strategy, one more closely matching the implementation and configuration in Graphext, would be to say that we split the dataset once into training and test splits for evaluation, and then train the model (including the tuning of hyperparameters), by splitting the training set again into train and evaluation splits.

In this strategy the model needs to be fit (H + 1) + H + 1 = 2H + 2 times in total, for H different hyperparameter combinations. Selecting between 4 different candidates, for example, would result in a total of 10 model fits.

## Holdout cross-validation

Slightly more robust than the previous version, this is essentially the *holdout* method for model evaluation (dedicated train and test sets), but using cross-validation for tuning by splitting the training set repeatedly (in the previous method we simply split it once):

CV for model tuning and holdout for evaluation. A/ We split the dataset once into test and tuning sets. We select the best hyperparameters using grid search (e.g.) and cross-validation on the tuning set. After being refit again on the whole tuning set, the winner is evaluated on the held out test set. The result is our estimate of generalisation performance. B/ We use the same grid-search with CV approach on the entire dataset to pick our final hyperparameters. sing the best hyperparameters, we fit the final model on the entire dataset.

Note that by definition we have only used a single split to evaluate our whole training procedure, which in this case includes hyperparameter tuning. This can result in a biased estimate of the model’s performance. If the dataset as a whole is larger enough, and with it the holdout set, it may be sufficient. Otherwise we can address this with the nested cross-validation approach explained in next sections.

Using a *K*-fold cross-validation to select between *H* different hyperparameter candidates, in this strategy the model needs to be fit (K _ H) + 1 times to estimate its performance, another (K _ H) times to pick the best hyperparameters, and a final time to fit the best model using all data. This makes for a total of 2HK + 2 model fits. With K=5 and H=4, for example, this adds up to 42 fitting iterations.

### Nested cross-validation

*Nested* cross-validation addresses the issue of wanting to use cross-validation for both

- reliably picking hyperparameters (instead of relying on a single split to select the winner)
- estimating the expected performance of the final model

without leaking data used in hyperparameter selection into our estimations.

This is somewhat tricky to get right. Perhaps the easiest way to understand nested cross-validation is to treat the tuning of the model’s hyperparameters (candidate selection) as part of the regular training procedure. In essence, we treat our model as a kind of meta-model, which now consists of its normal internal parameters as well as its hyperparameters, and training the model simply means fitting both types of parameters given some data.

Seen this way, *evaluating* the generalisation performance of our meta-model does indeed simply consist of a k-fold cross-validation as explained above. E.g. we split the data into 5 equal parts and then iteratively use 4 parts to identify the best model (hyperparameters), and the fifth part to test its performance:

Model evaluation with hyperparameter tuning. From a higher-level perspective this is just the normal CV. But now, we use each training fold to find the best hyperparameters of our model.

The average of the 5 folds then is how we expect our combined hyperparameter tuning and model fitting procedure to perform on unseen data. Note that the best model, i.e. the best combination of hyperparameters, may be different in each of the k iterations. But this doesn’t matter. We are *not* selecting any of the “winners” from each iteration as our overall best model. The *only* purpose of the cross-validation is to estimate the *generalisation performance* of our overall training procedure, which now includes the hyperparameter tuning.

#### Inner loop

But, *how exactly* do we select the best model in each iteration of our cross-validation loop? As the name suggests, in *nested cross-validation* we use an *outer loop* to evaluate our overall model training procedure, and a second *inner loop* to select hyperparameters (tuning). I.e. for each *outer* loop iteration, we split the training set again into k parts, use k-1 parts to *fit* our different candidates (hyperparameter combinations), and the kth part to *evaluate* each candidate’s generalisation, like so:

Nested cross-validation. Example of a 5-3 nested CV. In the outer-loop we split the dataset 5 times into two parts: 1/ samples used for testing generalisation performance, and 2/ samples used for tuning our model (i.e. for selecting the best hyperparameters). In a second inner loop, we split the tuning set again 3 times into two parts that are used for 1/ fitting our candidates with different hyperparameters train and 2/ selecting the winner among them using their average performance on the evaluation samples. The winner of the inner loop is then fit again on the whole tune set of the outer loop before being tested on the test set.

In each *inner* loop, we then select as the winner the model with the best average performance across the evaluation folds. We train this model again using all data in the outer loop’s tuning set, evaluate it on the test fold, and report the average of all winners across the outer loop as our best estimate of the overall training procedure.

#### Candidate selection

We can zoom further in to get a clearer idea of how the best candidate is selected in each outer loop iteration. Here is one such iteration:

Detail of a single CV outer loop iteration. A single outer fold (tune is used to select between 4 different candidate models (hyperparameter combinations). The candidate with best average performance across all eval folds is selected as the winner. The winning hyperparameter combination is then trained again using all data in the tune fold, and evaluated on the test fold. The whole procedure is then repeated for all outer loop iterations, and the average across all winners on the test folds is our final estimate of the tuned model’s generalisation performance.

If we wanted to tune e.g. 2 hyperparameters of our model, and for each parameter try 2 different values, this would lead to 4 candidate models (4 different combinations of hyperparameters). We may e.g. train a support vector machine with regularisation strengths C in `{1,10}`

and values of γ in `{scale, auto}`

. As shown in the figures above, the winner of each outer loop is the candidate with the best average performance across the inner loop evaluation folds.

#### Final model

As mentioned above, we don’t select any of the winners from our nested cross-validation as our final model. As in the simple case, once we have an estimate of future generalisation performance, we want to make sure we use as much data as is available to fit our final model.

Now, keeping in mind again that training our “meta-model” consists of tuning the model’s hyperparameters using cross-validation (the inner loop basically), our final model is trained exactly like that:

- Use a single k-fold cross-validation over the
*whole dataset to*pick the best hyperparameters - Use these hyperparameters to train a single model on all available data

#### Summary

Schematically, then, the whole procedure of using nested cross-validation for model tuning and evaluation looks like this:

Overview of model tuning and evaluation using nested cross-validation. A/ using nested cross-validation to evaluate the generalisation performance of tuning and fitting the model. B Using the same inner cross-validation on the entire dataset to pick hyperparameters for the final model. C/ Using the best hyperparameters to train the final model on the entire dataset.

Whenever you don’t have a huge amount of data, and execution time is not a great concern, we recommend to go for the “full monty” and use nested cross-validation for model selection and evaluation. It can be somewhat slow though. If $H$ is the number of different hyperparameter combinations to try, and $N$, $K$ the number of folds in the outer and inner cross-validation loops, this requires fitting the model:

- $N * K * H$ times to estimate performance
- $K * H$ times to select the best hyperparameters
- 1 time to train the final model

This makes for a total of $NKH + KH + 1$ fitting iterations. For example, a 3x5 nested cross-validation and 4 different candidate models would result in 96 model fits.

## Evaluation only

If we already have a trained model we may simply want to evaluate it again on new data; perhaps to test whether it still performs as intended, or whether data drift may have led to a degradation in performance.

This is not possible at the moment in Graphext, but will be in the future using a dedicated step (something like `test_classification`

instead of `train_classification`

e.g.).

## Summary

We have seen that how to train and evaluate a ML model depends on at least two decisions:

- whether to include the selection of hyperparameters in the training (tuning)
- selecting a simple holdout strategy (single split of the dataset intro training and test sets), or a more robust cross-validation (multiple k-fold or shuffle-split iterations).

Both decisions constitute a tradeoff between execution time, resulting model performance and robustness of the *estimated* model performance.

Tuning of hyperparameters should in principle be at least as good, but in most cases hopefully better than not tuning. It may take significantly more time, though, since reliably picking the best hyperparameters depends on fitting the same model to different splits of the dataset many times.

Picking a simple holdout method is faster than cross-validation, since it uses fewer iterations over dataset splits. But for it to be reliable, it is best to use it when the dataset is on the large side. If robustness of the estimated performance is important, and the dataset not large, cross-validation may be more advisable.

There are no hard rules unfortunately, but as a simple heuristic, if execution time is not a great concern, prefer cross-validation over the simpler holdout. Taking inspiration from Sebastian Raschka’s summary in “Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning” (Raschka, 2018), we can summarize the options available in Graphext in the following figure:

Summary of recommended model tuning and evaluation strategies. 'Repeated holdout' is synonymous with the 'shuffle-split' method, and nested cross-validation may be of the k-fold or shuffle-split kind.

As for how to configure Graphext model training, this can be summarised succinctly now:

- To pick a cross-validation strategy (to estimate performance or pick hyperparameters), use

`"validate": {"n_splits": n}`

- To pick a shuffle-split strategy:

`"validate": {"n_splits": n, "test_size": x}`

- The simple holdout method is a special case of shuffle-split using a single split only:

`"validate": {"n_splits": 1, "test_size": x}`

To configure both, the splitting strategy used for evaluation and for tuning, use the same parameters inside and outside the ”tune” section, like so:

```
train_classification(ds, {
"target": "churn",
"tune": {
"strategy": "grid",
"params": {
"C": [1, 10],
"gamma": ["scale", "auto"]
},
"validate": {
"n_splits": 5
}
},
"validate": {
"n_splits": 1,
"test_size": 0.2
}
}) -> (ds.pred, "my-model")
```

This will use 5-fold cross-validation to pick the hyperparameters, and a single holdout split with 20% of the sample for evaluation.

The intention of this overview has been to introduce the principal ways that Graphext trains and evaluates ML models. We haven’t touched on many finer details regarding the tradeoffs in bias and variance of the mentioned strategies etc. Check the below reference for a deeper scientific understanding of these and related method.

## References

**Conceptual Overview**

- Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning (Raschka, 2018). Also see complementary notebook.

**Academic**

- On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation (Cawley & Talbot, 2009).
- Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap (Kim, 2019).

**Scikit-learn**

- Cross-validation overview
- Nested versus non-nested cross-validation
- Nested cross-validation chapter in MOOC

**Stack Exchange**

Was this page helpful?