# Model training and evaluation¶

Amongst existing “machine learning as a service” tools, Graphext aims to be the easiest and most intuitive tools you can use to quickly train a prediction model on tabular data. While simply fitting a ML model to data is relatively straightforward, whether in Graphext or elsewhere, a *good* strategy to create the *best* model, making the most use of your data, should consider two important aspects:

*Model evaluation:*how to estimate the model’s future performance on unseen data*Hyperparameter tuning*: how to best select certain parameters of the model that are not directly learned from data

We here document Graphext’s strategy for model training, tuning and evaluation, which we hope is flexible enough for most use cases, while not overly complex to understand.

For a more in-depth treatment of some of the topics mentioned here see the references in the final section, particularly “Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning" Raschka (2018) for a more conceptual overview, or scikit-learn’s overview from a coding perspective.

The following sections are organised from most simple to most complete model training strategies. We strive to explain these in a way that is useful to any ML practitioner, while also showing how to configure them in Graphext in particular.

## Introduction¶

To begin with, let’s clarify the scope of this article. Firstly, we will talk here about supervised ML models mostly; i.e. models which given some samples, each described by a set of features, and corresponding labels, will predict unknown labels for samples it hasn’t seen before. So this could be a model learning from past bank customer’s financial behaviour to predict a person’s credit risk (numerical prediction / regression); or a model predicting whether or not an image contains hot dogs (classification).

Secondly, what we mean by *model evaluation* is estimating how good our model will be at predicting future, unseen data. To do this we need two things:

- A metric, assigning a numerical score to our model indicating how good its predictions are. Metrics are usually calculated by comparing some samples’ true labels with those predicted by the model. This could be something like
*accuracy*(proportion of correctly predicted labels), or the mean squared error. The appropriate metric may depend on the use case (e.g. minimising the rate of false negatives may be more important than false positives in a medical diagnostic test; while the opposite may be true in other scenarios). - Some data the model hasn’t seen during its training. If we evaluated the model using data it already “knows”, we may overestimate how good it will perform on truly new data. Nothing would prevent it from simply memorising the data is has been presented, instead of learning to generalise, i.e. to learn the patterns and relationships between features and labels.

How to make best use of our data to both train and evaluate a model, while making sure our estimate of its performance isn’t overly optimistic, is what we refer to as a *training strategy*. Let’s start with the simplest possible strategies and build up towards the more complicated cases step-by-step.

## Simple model training and evaluation¶

Let’s first consider training a model’s internal parameters only, while leaving its hyperparameters fixed, e.g. using its default values, or selecting them manually based on experience (we’ll talk more about hyperparameters in later sections).

### No evaluation¶

In principle, we could simply use all our data to train a model, without evaluating its performance. I.e. the simplest possible (but not advisable) training strategy is simply:

Here by “entire dataset” we mean a tabular dataset that contains N samples as rows, M features as columns, and a corresponding set of labels (one per sample). In code, these are often referred to as `X`

(NxM samples) and `y`

(N labels).

So in the simplest case, we simply pass our model all available samples and labels to learn from. We will have no idea if it’s predictions are any good. Or not yet. The only imaginable use case for this would be if additional data for evaluation would become available later, separately, so that at this point in time all we can do is fit our model blindly.

If we measure the "complexity" of our training strategy by the number of times a model is fit to data, this simplest strategy has a complexity of 1, since we fit the model exactly once using all data.

## Configuring this in Graphext

In Graphext, we can train a model like this by simply leaving its configuration empty, or almost. For example, training an unspecified model without evaluation is simply:

```
train_classification(ds, {"target": "churn"}) -> (ds.pred, "my-model")
```

For detailed documentation, see for example https://docs.graphext.com/steps/prepare/model/train_classification/.

In Graphext’s training steps we always pass features and labels together as a single dataset (since you’d usually have them together in the same CSV file or database table), and indicate the column containing labels using the `target`

parameter.

If no particular model (CatBoost, linear regression etc.) is configured, Graphext will automatically select a default (best) model for the task (classification/regression). The task itself will be determined by the data type of the target column (the labels): classification if the target is categorical (or boolean), and regression if it is numerical.

The above example, e.g., will train a CatBoost classifier to predict the `churn`

variable in the dataset `ds`

. It will output predictions for the very same samples used to train it (as a new column `pred`

in the dataset `ds`

, and save the model under the name `“my-model”`

for future use.

Since we haven’t asked for model evaluation, and since we have used all data to train our model, the only thing we can measure is how well the model can predict labels for the same samples used to train it. By default Graphext will pick some appropriate metrics for the task and report these as the “train metrics” in the Models section of your project. Note that these are useless as estimates of the model’s real performance. Their only purpose is to gain some insights into whether the model was able to learn anything at all from the data. I.e., if it’s accuracy is bad even on the training set, then either the data doesn’t containing any learnable patterns, or the model is not powerful enough to find them (or to memorise them).

### Holdout method¶

If all we need is some unseen data to evaluate our model, the simplest possible strategy is to split the dataset into two parts. We use one part to train our model and the other to evaluate it:

Note that when we refer to *splits* of the data, each split is meant to contain both the samples and their features, as well as the corresponding labels. Here, in a first step, the model is fit using samples and labels in the **train** split. The fitted model is then used to make predictions for the samples in the **test** split. These predictions are compared with the test split’s true labels to calculate the estimated generalisation performance of the model.

Now that we have an “unbiased” performance measure for our model (i.e. one calculated using unseen data), we are free to use all available data to create the final model. So in a second step we fit the same model again using the *entire* dataset.

## Pessimistic bias

Note that since the final model has been trained with *more* data than was used for its evaluation, if at all it should be slightly *better* than what we estimated in the first step. But this is a much better situation than potentially having a *worse* model, which could happen if we evaluated our model without first reserving some test data.

In terms of complexity, in this strategy the model needs to be fit twice (once for evaluation and once more for the final model), so we could say its complexity is 2.

## Shuffling

In the above diagram we have somewhat arbitrarily selected consecutive samples at the beginning and end of the dataset for our train and test splits respectively. By convention, we assume here that the dataset has either been shuffled already (while preserving the correspondence of samples and labels), or that it has no intrinsic order. We could also have illustrated the shuffled and split data like this:

But it will be more convenient to assume that data is shuffled already and use consecutive blocks of data as splits in diagrams from here on.

## Configuring this in Graphext

To configure the simple holdout strategy in Graphext, we can pass the following parameters to one of our model training steps:

```
train_classification(ds, {
"target": "churn"
"validate": {
"n_splits": 1
"test_size": 0.25
}
}) -> (ds.pred, "my-model")
```

This configuration tells us that we want to `validate`

the model during training, and that we want to do so by splitting the data once into two parts (`”n_splits”: 1`

). It also asks that the test split contain 25% of the samples (`”test_size”: 0.25`

), meaning the remaining 75% of samples will be used for training.

In all cases, independent of any configuration, the final model in Graphext will always be trained on *all* data, so step B in the above diagram is always implicit.

### Cross-validation¶

When we have a lot of data, we may simply reserve a proportion of the data for testing and use the remaining data to fit its parameters, as mentioned above. However, when a dataset is already small, this means fitting the model on an even smaller part of it, which may not be enough given its complexity (usually, the more parameters a model has the more data is needed to optimise it). In addition, evaluating the model on a single random (and small) proportion of the original dataset may result in unreliable estimation (high bias), as it is not guaranteed that the distribution of data in the test part is similar to that in the training part (or to the greater “population” the samples come from).

To remedy this, a common method to *evaluate* a model’s performance on limited data, and the one used by default in Graphext, is to use *cross-validation (CV)*.

**K-fold cross-validation**

Perhaps the most common form of cross-validation is the K-fold CV. The idea here is to split the dataset into K *folds*, and then use K-1 folds for fitting the model and the remaining fold to evaluate the generalisation performance of the model on data it hasn’t seen before.

For example, in a 5-fold cross-validation we divide the dataset into 5 equal-sized, non-overlapping parts, each containing 20% of the samples. We then run 5 iterations and in each:

- select 4 parts of the dataset (80%) to fit the model
- select 1 part of the dataset (20%) to evaluate its performance

We then report the average of the model’s performance on the 5 test folds as the expected performance of the model on unseen data. The advantage of this method is that the model is guaranteed to get evaluated on *all available* samples.

A complete strategy for using k-fold cross-validation to train and evaluate our model then looks like this:

We use k-fold cross-validation to estimate the model’s performance, and then fit it again using the whole dataset. The "complexity" of this strategy is thus K + 1, where K is the number of folds in the cross-validation.

## Pessimistic Bias

To stress a point already made above, to make best use of all the data available, the *final model will always be fitted again on the entire dataset*. This means the estimated performance may be slightly *pessimistic*, as the model may not have reached its maximum capacity when fitted with only ⅘ of the dataset, for example (perhaps with more data the model would do better).

## Configuring this in Graphext

In Graphext, we can configure cross-validation simply by omitting the `test_size`

parameter we used in the holdout method:

```
train_classification(ds, {
"target": "churn"
"validate": {
"n_splits": 3
},
"params": {
"C": 1,
"gamma": "scale"
}
}) -> (ds.pred, "my-model")
```

We don’t need the `test_size`

parameter in the `validate`

section, because k-fold cross-validation splits the datasets into `n_splits`

*equal-sized* parts always. Conversely, not providing the `test_size`

parameter is how we indicate in Graphext that we want *k-fold* cross-validation, rather than the *shuffle-split* method.

Note we also introduced configuration for selecting some of the model’s hyperparameters by hand. If you don’t want to *tune* them (we will learn how in below sections), you can either leave them at their defaults, or provide constants using the `params`

field.

**Repeated holdout cross-validation**

An alternative to *k-fold* cross-validation is to independently split the dataset k times into two random train and test sets. E.g. 5 such *shuffle-split* iterations, maintaining a proportion of 80% training data and 20% test data, may look like this:

Note that in this case the train/test proportion is independent from the number of splits, i.e. we have the flexibility to e.g. evaluate the model 50 times on random 75%–25% splits of the data. However, it is not guaranteed that the model sees all the data in the process, nor that the splits are different from each other. This method is also sometimes called *shuffle-split* or *Monte Carlo cross-validation*.

Schematically, the complete strategy for using repeated holdout validation to train and evaluate our model then would be:

The complexity of this strategy is also K + 1, as all that's changed is *how* we split the data, not the number of times we do it.

## Holdout as special case of single-split cross-validation

The *holdout* strategy mentioned above can be seen as a special case of the *shuffle-split* with a single iteration only.

Note that this is different from the special case of a *2-fold cross-validation*, which would also split the dataset only once, but into two equal parts containing 50% of the data each. It would then use two iterations to fit the model on one half while evaluating it on the other:

## Configuring this in Graphext

The repeated holdout (shuffle-split) is configured very similar to the k-fold CV in Graphext. We simply specify the desired `test_size`

of each iteration.

```
train_classification(ds, {
"target": "churn"
"validate": {
"n_splits": 3
"test_size": 0.2
},
"params": {
"C": 1,
"gamma": "scale"
}
}) -> (ds.pred, "my-model")
```

## Model tuning and evaluation¶

Many ML models have so-called *hyperparameters* that determine *exactly how* the model learns from data. Basic regression models e.g. fit their coefficients to data such as to minimise a certain loss metric. A *regularised* regression additionally implements a penalty on the coefficients (e.g. to keep the coefficients small, or to use fewer coefficients if possible). The strength of this penalty is one such hyperparameter. The maximum allowed depth of a decision tree, or the number of decision trees in a random forest are other examples.

If you’re lucky, the model in question works well out of the box, with all hyperparameters at their default values. Or if you have a lot of experience training a specific kind of model, you may have some intuition about values that work best in specific scenarios. If neither is the case, or you feel your model could or should perform better than what you’re seeing with the default hyperparameters, you may want to *tune* them. *Tuning* here simply means finding their values automatically, and such that the performance of the model is optimised. In practice, this means using data to select the best from a number of *candidate* *models* having different hyperparameter values.

We may be tempted to simply use the same methodology as explained above to find the best hyperparameters and evaluate our model’s performance. We could e.g. fit 3 different model *candidates* using k-fold cross-validation and select the one that on average had the best performance. The question then arises what its estimated performance would be on *unseen data*. If we simply reported the average from our cross-validation, we would be cheating. The estimate would be biased, because we have used the same data to identify the best model (i.e. to select from our candidates and train it), and to estimate its performance. I.e. we haven’t reserved any data to stand in for future, *unseen* data.

The correct way to both *tune* a model’s hyperparameters *and estimate* its generalisation performance, is to use *nested* cross-validation. The general idea is to evaluate the complete training procedure (hyperparameter selection and model fitting) as we would do in a normal cross-validation, but in each iteration of the evaluation, we split the training set again using an inner cross-validation loop to pick the hyperparameters in a robust way.

But instead of directly jumping in to this rather complex strategy, let's build towards it step-by-step starting from simpler strategies.

### No evaluation (don’t do this)¶

As we mentioned above, we don’t recommend fitting or tuning a model without evaluating its generalisation performance. Do this only if you plan to collect more data and evaluate the model later on. Having said that, we *can* tune a model, using the holdout method or cross-validation, and simply report the same performance we used to pick our hyperparameters as a (**bad**) estimate of future performance.

#### Holdout method for tuning without evaluation¶

The simplest method for tuning our hyperparameters would be to use a single holdout set to pick the best from a set of candidate hyperparameter settings:

This is analogous to the simple holdout method mentioned above, but instead of evaluating a *single* model and reporting its performance on the test set, we evaluate *multiple* model candidates (hyperparameter combinations), and use the test set to pick the best among them.

We might perhaps be tempted then to report either the *average* performance on the test set, or the *best* model's performance as our expected generalization ability. But this wouldn't be a good idea. The estimated performance will likely be optimistic, as we used the same data to select between models and to evaluate their generalisation. I.e., we haven’t tested our model training strategy on *unseen* data.

In this strategy our model needs to be fit to data H + 1 times, where H is the number of different hyperparameter combinations to try (H candidates on the training split, and the final model on all data).

## Configuring this in Graphext

To tune hyperparameters in Graphext, simply add a `tune`

section to the step’s configuration containing the names and ranges of parameters to explore, like so:

```
train_classification(ds, {
"target": "churn"
"tune": {
"strategy": "grid",
"params": {
"C": [1, 10],
"gamma": ["scale", "auto"]
},
"validate": {
"n_splits": 1,
"test_size": 0.8
}
},
}) -> (ds.pred, "my-model")
```

Note that the `validate`

section in this code snippets is located *inside* the `tune`

section. This is to indicate that this is the strategy we want to use to select between different hyperparameter settings, not to evaluate generalisation performance. It has the same name `validate`

, because it accepts exactly the same parameters (`n_splits`

, `test_size`

etc.).

The above configuration will split the dataset once into 80% of samples to be used to train our hyperparameter candidates, and 20% to evaluate and pick the winner. Any performance metrics reported back in the Models section will be biased and optimistic, since we haven’t reserved any data for testing.

#### Cross-validation for tuning without evaluation¶

We can also use cross-validation instead of the holdout method to select the model's hyperparameters without (properly) evaluating its performance:

This works the same as the holdout for model tuning, but instead of comparing different hyperparameter combinations on a single train/test split of the data, we compare them using the average over multiple folds of the data. Note, that since we still don't test our procedure on unseen data, the same caveats regarding bias and overly optimistic metrics apply here as well.

In this strategy the model needs to be fit (K * H) + 1 times in total, for H different candidates and K folds in our cross-validation. A 5-fold cross-validation for exploring 4 different hyperparameter combinations, for example, would result in a total of 21 model fits.

## Configuring this in Graphext

As before, in Graphext we simply omit the `test_size`

parameter and select the number of splits (`n_splits`

) to be used for cross-validation:

```
train_classification(ds, {
"target": "churn",
"tune": {
"strategy": "grid",
"params": {
"C": [1, 10],
"gamma": ["scale", "auto"]
},
"validate": {
"n_splits": 3,
}
},
}) -> (ds.pred, "my-model")
```

### Three-way holdout¶

The simplest strategy to tune *and* evaluate a model on unseen data is the *three-way holdout*. This is a simple extension of the holdout method mentioned in the beginning. It splits the dataset once into dedicated *training*, *validation* and *test* sets. This setup is often used in deep learning contexts, where fitting a single model is very expensive but datasets are huge:

We evaluate the performance of our model training strategy by:

- fitting our candidate models on the
**training**set - picking the best candidate by evaluating them using the
**evaluation**set - calculating the final score of the winning candidate on the
**test**set after having re-fit it on the combined**training**and**evaluation**sets

Having an estimate of our model’s performance, we can then apply the same strategy of picking the best hyperparameters using a simple two-way holdout split (combining the train and eval sets), and finally train the model using the best hyperparameters on the whole dataset.

Another way to describe the same strategy, one more closely matching the implementation and configuration in Graphext, would be to say that we split the dataset once into training and test splits for evaluation, and then train the model (including the tuning of hyperparameters), by splitting the training set again into train and evaluation splits.

In this strategy the model needs to be fit (H + 1) + H + 1 = 2H + 2 times in total, for H different hyperparameter combinations. Selecting between 4 different candidates, for example, would result in a total of 10 model fits.

## Configuring this in Graphext

In Graphext a single random holdout split is configured by setting `"n_splits": 1`

and selecting the proportion allocated for testing (`”test_size”: 0.2`

, e.g.). Since we want to use a single split to pick our hyperparameters, and a single split again for evaluating, we can combine these in the inner and outer `validate`

sections of the configuration:

```
train_classification(ds, {
"target": "churn",
"tune": {
"strategy": "grid",
"params": {
"C": [1, 10],
"gamma": ["scale", "auto"]
},
"validate": {
"n_splits": 1,
"test_size": 0.2
}
},
"validate": {
"n_splits": 1,
"test_size": 0.25
}
}) -> (ds.pred, "my-model")
```

This would reserve 25% of data for testing. The remaining 75% would be split again into 80% for training and 20% for validation (model selection).

### Holdout cross-validation¶

Slightly more robust than the previous version, this is essentially the *holdout* method for model evaluation (dedicated train and test sets), but using cross-validation for tuning by splitting the training set repeatedly (in the previous method we simply split it once):

Note that by definition we have only used a single split to evaluate our whole training procedure, which in this case includes hyperparameter tuning. This can result in a biased estimate of the model's performance. If the dataset as a whole is larger enough, and with it the holdout set, it may be sufficient. Otherwise we can address this with the nested cross-validation approach explained in next sections.

Using a *K*-fold cross-validation to select between *H* different hyperparameter candidates, in this strategy the model needs to be fit (K * H) + 1 times to estimate its performance, another (K * H) times to pick the best hyperparameters, and a final time to fit the best model using all data. This makes for a total of 2HK + 2 model fits. With K=5 and H=4, for example, this adds up to 42 fitting iterations.

## Configuring this in Graphext

In Graphext, to use a single shuffle-split partitioning of the dataset for evaluation, select `"n_splits": 1`

and a `test_size`

parameter in the outer validate section. To use cross-validation in the inner loop for hyperparameter selection, provide only the number of desired folds (`"n_splits": 5`

here):

```
train_classification(ds, {
"target": "churn",
"tune": {
"strategy": "grid",
"params": {
"C": [1, 10],
"gamma": ["scale", "auto"]
},
"validate": {
"n_splits": 5
}
},
"validate": {
"n_splits": 1,
"test_size": 0.2
}
}) -> (ds.pred, "my-model")
```

The above would reserve a random 20% of holdout data for testing, and use the remaining 80% for training, where training consists of a 5-fold cross-validation to select the best hyperparameters.

### Nested cross-validation¶

*Nested* cross-validation addresses the issue of wanting to use cross-validation for both

- reliably picking hyperparameters (instead of relying on a single split to select the winner)
- estimating the expected performance of the final model

without leaking data used in hyperparameter selection into our estimations.

This is somewhat tricky to get right. Perhaps the easiest way to understand nested cross-validation is to treat the tuning of the model’s hyperparameters (candidate selection) as part of the regular training procedure. In essence, we treat our model as a kind of meta-model, which now consists of its normal internal parameters as well as its hyperparameters, and training the model simply means fitting both types of parameters given some data.

Seen this way, *evaluating* the generalisation performance of our meta-model does indeed simply consist of a k-fold cross-validation as explained above. E.g. we split the data into 5 equal parts and then iteratively use 4 parts to identify the best model (hyperparameters), and the fifth part to test its performance:

The average of the 5 folds then is how we expect our combined hyperparameter tuning and model fitting procedure to perform on unseen data. Note that the best model, i.e. the best combination of hyperparameters, may be different in each of the k iterations. But this doesn’t matter. We are *not* selecting any of the “winners” from each iteration as our overall best model. The *only* purpose of the cross-validation is to estimate the *generalisation performance* of our overall training procedure, which now includes the hyperparameter tuning.

#### Inner loop¶

But, *how exactly* do we select the best model in each iteration of our cross-validation loop? As the name suggests, in *nested cross-validation* we use an *outer loop* to evaluate our overall model training procedure, and a second *inner loop* to select hyperparameters (tuning). I.e. for each *outer* loop iteration, we split the training set again into k parts, use k-1 parts to *fit* our different candidates (hyperparameter combinations), and the kth part to *evaluate* each candidate’s generalisation, like so:

In each *inner* loop, we then select as the winner the model with the best average performance across the evaluation folds. We train this model again using all data in the outer loop’s tuning set, evaluate it on the test fold, and report the average of all winners across the outer loop as our best estimate of the overall training procedure.

#### Candidate selection¶

We can zoom further in to get a clearer idea of how the best candidate is selected in each outer loop iteration. Here is one such iteration:

If we wanted to tune e.g. 2 hyperparameters of our model, and for each parameter try 2 different values, this would lead to 4 candidate models (4 different combinations of hyperparameters). We may e.g. train a support vector machine with regularisation strengths C in {1,10} and values of γ in {scale, auto}. As shown in the figures above, the winner of each outer loop is the candidate with the best average performance across the inner loop evaluation folds.

#### Final model¶

As mentioned above, we don’t select any of the winners from our nested cross-validation as our final model. As in the simple case, once we have an estimate of future generalisation performance, we want to make sure we use as much data as is available to fit our final model.

Now, keeping in mind again that training our “meta-model” consists of tuning the model’s hyperparameters using cross-validation (the inner loop basically), our final model is trained exactly like that:

- Use a single k-fold cross-validation over the
*whole dataset to*pick the best hyperparameters - Use these hyperparameters to train a single model on all available data

#### Summary¶

Schematically, then, the whole procedure of using nested cross-validation for model tuning and evaluation looks like this:

Whenever you don’t have a huge amount of data, and execution time is not a great concern, we recommend to go for the “full monty” and use nested cross-validation for model selection and evaluation. It can be somewhat slow though. If H is the number of different hyperparameter combinations to try, and N, K the number of folds in the outer and inner cross-validation loops, this requires fitting the model:

- N * K * H times to estimate performance
- K * H times to select the best hyperparameters
- 1 time to train the final model

This makes for a total of NKH + KH + 1 fitting iterations. For example, a 3x5 nested cross-validation and 4 different candidate models would result in 96 model fits.

## Nested cross-validation for coders

From a coding perspective, and taking scikit-learn as an example, our “meta-model” corresponds to simply wrapping our original model in a `GridSearchCV`

object (which internally uses cross-validation to find the best hyperparameters among a set of candidates). We then use a simple cross-validation to evaluate the meta-model’s generalisation performance, and refit it to the whole dataset to create the final model (also see complete example in scikit-learn):

```
inner_cv = KFold(n_splits=3, shuffle=True, random_state=i)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=i)
hyper_params = {"C": [1, 10, 100], "gamma": [0.01, 0.1]}
model = SVC(kernel="rbf")
metamodel = GridSearchCV(estimator=model, param_grid=hyper_params, cv=inner_cv, refit=True)
generalization_score = cross_val_score(metamodel, X, y, cv=outer_cv)
final_model = metamodel.fit(X, y, refit=True)
```

Here, `cross_val_score`

estimates generalisation of our entire model training and selection procedure (the grid-search CV). The final `metamodel.fit()`

(here GridSearchCV’s `fit()`

) then picks the best hyperparameters using the *whole* dataset, and refits this best model again on the whole dataset.

## Configuring this in Graphext

Since we want to use k-fold cross-validation in both the outer loop (evaluation) and the inner loop (hyperparameter selection), we simply select the desired number of splits for both:

```
train_classification(ds, {
"target": "churn",
"tune": {
"strategy": "grid",
"params": {
"C": [1, 10],
"gamma": ["scale", "auto"]
},
"validate": {
"n_splits": 3
}
},
"validate": {
"n_splits": 5
}
}) -> (ds.pred, "my-model")
```

This would split the data 5 times (5-fold cross-validation) in the outer loop to evaluate generalisation performance, and in the inner loop (”tune”) use grid search with 3-fold cross-validation to select hyperparameters.

## Evaluation only¶

If we already have a trained model we may simply want to evaluate it again on new data; perhaps to test whether it still performs as intended, or whether data drift may have led to a degradation in performance.

This is not possible at the moment in Graphext, but will be in the future using a dedicated step (something like `test_classification`

instead of `train_classification`

e.g.).

## Summary¶

We have seen that how to train and evaluate a ML model depends on at least two decisions:

- whether to include the selection of hyperparameters in the training (tuning)
- selecting a simple holdout strategy (single split of the dataset intro training and test sets), or a more robust cross-validation (multiple k-fold or shuffle-split iterations).

Both decisions constitute a tradeoff between execution time, resulting model performance and robustness of the *estimated* model performance.

Tuning of hyperparameters should in principle be at least as good, but in most cases hopefully better than not tuning. It may take significantly more time, though, since reliably picking the best hyperparameters depends on fitting the same model to different splits of the dataset many times.

Picking a simple holdout method is faster than cross-validation, since it uses fewer iterations over dataset splits. But for it to be reliable, it is best to use it when the dataset is on the large side. If robustness of the estimated performance is important, and the dataset not large, cross-validation may be more advisable.

There are no hard rules unfortunately, but as a simple heuristic, if execution time is not a great concern, prefer cross-validation over the simpler holdout. Taking inspiration from Sebastian Raschka’s summary in “Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning" (Raschka, 2018), we can summarize the options available in Graphext in the following figure:

As for how to configure Graphext model training, this can be summarised succinctly now:

- To pick a cross-validation strategy (to estimate performance or pick hyperparameters), use

`"validate": {"n_splits": n}`

- To pick a shuffle-split strategy:

`"validate": {"n_splits": n, "test_size": x}`

- The simple holdout method is a special case of shuffle-split using a single split only:

`"validate": {"n_splits": 1, "test_size": x}`

To configure both, the splitting strategy used for evaluation and for tuning, use the same parameters inside and outside the ”tune” section, like so:

```
train_classification(ds, {
"target": "churn",
"tune": {
"strategy": "grid",
"params": {
"C": [1, 10],
"gamma": ["scale", "auto"]
},
"validate": {
"n_splits": 5
}
},
"validate": {
"n_splits": 1,
"test_size": 0.2
}
}) -> (ds.pred, "my-model")
```

This will use 5-fold cross-validation to pick the hyperparameters, and a single holdout split with 20% of the sample for evaluation.

## Stratification

Many classification problems are defined by target variables (labels) with considerable *imbalance*. The different classes of the target variable are represented in significantly different proportions in the dataset. In this case, it is usually a good idea when splitting the dataset to try and maintain the same proportions in each split, a method called *stratified sampling*. Graphext applies stratified k-fold cross-validation or shuffle-split by default when the target variable is categorical (classification).

The intention of this overview has been to introduce the principal ways that Graphext trains and evaluates ML models. We haven’t touched on many finer details regarding the tradeoffs in bias and variance of the mentioned strategies etc. Check the below reference for a deeper scientific understanding of these and related method.

## References¶

**Conceptual Overview**

- Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning (Raschka, 2018). Also see complementary notebook.

**Academic**

- On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation (Cawley & Talbot, 2009).
- Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap (Kim, 2019).

**Scikit-learn**

- Cross-validation overview
- Nested versus non-nested cross-validation
- Nested cross-validation chapter in MOOC

**Stack Exchange**