Skip to content

Model training and evaluation

Amongst existing “machine learning as a service” tools, Graphext aims to be the easiest and most intuitive tools you can use to quickly train a prediction model on tabular data. While simply fitting a ML model to data is relatively straightforward, whether in Graphext or elsewhere, a good strategy to create the best model, making the most use of your data, should consider two important aspects:

  • Model evaluation: how to estimate the model’s future performance on unseen data
  • Hyperparameter tuning: how to best select certain parameters of the model that are not directly learned from data

We here document Graphext’s strategy for model training, tuning and evaluation, which we hope is flexible enough for most use cases, while not overly complex to understand.

For a more in-depth treatment of some of the topics mentioned here see the references in the final section, particularly “Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning" Raschka (2018) for a more conceptual overview, or scikit-learn’s overview from a coding perspective.

The following sections are organised from most simple to most complete model training strategies. We strive to explain these in a way that is useful to any ML practitioner, while also showing how to configure them in Graphext in particular.

Introduction

To begin with, let’s clarify the scope of this article. Firstly, we will talk here about supervised ML models mostly; i.e. models which given some samples, each described by a set of features, and corresponding labels, will predict unknown labels for samples it hasn’t seen before. So this could be a model learning from past bank customer’s financial behaviour to predict a person’s credit risk (numerical prediction / regression); or a model predicting whether or not an image contains hot dogs (classification).

Secondly, what we mean by model evaluation is estimating how good our model will be at predicting future, unseen data. To do this we need two things:

  1. A metric, assigning a numerical score to our model indicating how good its predictions are. Metrics are usually calculated by comparing some samples’ true labels with those predicted by the model. This could be something like accuracy (proportion of correctly predicted labels), or the mean squared error. The appropriate metric may depend on the use case (e.g. minimising the rate of false negatives may be more important than false positives in a medical diagnostic test; while the opposite may be true in other scenarios).
  2. Some data the model hasn’t seen during its training. If we evaluated the model using data it already “knows”, we may overestimate how good it will perform on truly new data. Nothing would prevent it from simply memorising the data is has been presented, instead of learning to generalise, i.e. to learn the patterns and relationships between features and labels.

How to make best use of our data to both train and evaluate a model, while making sure our estimate of its performance isn’t overly optimistic, is what we refer to as a training strategy. Let’s start with the simplest possible strategies and build up towards the more complicated cases step-by-step.

Simple model training and evaluation

Let’s first consider training a model’s internal parameters only, while leaving its hyperparameters fixed, e.g. using its default values, or selecting them manually based on experience (we’ll talk more about hyperparameters in later sections).

No evaluation

In principle, we could simply use all our data to train a model, without evaluating its performance. I.e. the simplest possible (but not advisable) training strategy is simply:

Train without evaluation

Training a model without evaluating it. Don't try this at home!

Here by “entire dataset” we mean a tabular dataset that contains N samples as rows, M features as columns, and a corresponding set of labels (one per sample). In code, these are often referred to as X (NxM samples) and y (N labels).

So in the simplest case, we simply pass our model all available samples and labels to learn from. We will have no idea if it’s predictions are any good. Or not yet. The only imaginable use case for this would be if additional data for evaluation would become available later, separately, so that at this point in time all we can do is fit our model blindly.

If we measure the "complexity" of our training strategy by the number of times a model is fit to data, this simplest strategy has a complexity of 1, since we fit the model exactly once using all data.

Configuring this in Graphext

In Graphext, we can train a model like this by simply leaving its configuration empty, or almost. For example, training an unspecified model without evaluation is simply:

train_classification(ds, {"target": "churn"}) -> (ds.pred, "my-model")

For detailed documentation, see for example https://docs.graphext.com/steps/prepare/model/train_classification/.

In Graphext’s training steps we always pass features and labels together as a single dataset (since you’d usually have them together in the same CSV file or database table), and indicate the column containing labels using the target parameter.

If no particular model (CatBoost, linear regression etc.) is configured, Graphext will automatically select a default (best) model for the task (classification/regression). The task itself will be determined by the data type of the target column (the labels): classification if the target is categorical (or boolean), and regression if it is numerical.

The above example, e.g., will train a CatBoost classifier to predict the churn variable in the dataset ds. It will output predictions for the very same samples used to train it (as a new column pred in the dataset ds, and save the model under the name “my-model” for future use.

Since we haven’t asked for model evaluation, and since we have used all data to train our model, the only thing we can measure is how well the model can predict labels for the same samples used to train it. By default Graphext will pick some appropriate metrics for the task and report these as the “train metrics” in the Models section of your project. Note that these are useless as estimates of the model’s real performance. Their only purpose is to gain some insights into whether the model was able to learn anything at all from the data. I.e., if it’s accuracy is bad even on the training set, then either the data doesn’t containing any learnable patterns, or the model is not powerful enough to find them (or to memorise them).

Holdout method

If all we need is some unseen data to evaluate our model, the simplest possible strategy is to split the dataset into two parts. We use one part to train our model and the other to evaluate it:

Train without evaluation

Holdout method for model evaluation. This strategy has two steps. A/ Split the dataset into two parts, the train and the test split. Train the model using samples in the train split. Then evaluate the model using a metric of choice on samples and corresponding labels in the test split. B/ Make use of the entire dataset to train the final model.

Note that when we refer to splits of the data, each split is meant to contain both the samples and their features, as well as the corresponding labels. Here, in a first step, the model is fit using samples and labels in the train split. The fitted model is then used to make predictions for the samples in the test split. These predictions are compared with the test split’s true labels to calculate the estimated generalisation performance of the model.

Now that we have an “unbiased” performance measure for our model (i.e. one calculated using unseen data), we are free to use all available data to create the final model. So in a second step we fit the same model again using the entire dataset.

Pessimistic bias

Note that since the final model has been trained with more data than was used for its evaluation, if at all it should be slightly better than what we estimated in the first step. But this is a much better situation than potentially having a worse model, which could happen if we evaluated our model without first reserving some test data.

In terms of complexity, in this strategy the model needs to be fit twice (once for evaluation and once more for the final model), so we could say its complexity is 2.

Shuffling

In the above diagram we have somewhat arbitrarily selected consecutive samples at the beginning and end of the dataset for our train and test splits respectively. By convention, we assume here that the dataset has either been shuffled already (while preserving the correspondence of samples and labels), or that it has no intrinsic order. We could also have illustrated the shuffled and split data like this:

Single shuffled split
A single shuffled split of a dataset.

But it will be more convenient to assume that data is shuffled already and use consecutive blocks of data as splits in diagrams from here on.

Configuring this in Graphext

To configure the simple holdout strategy in Graphext, we can pass the following parameters to one of our model training steps:

train_classification(ds, {
  "target": "churn"
    "validate": {
        "n_splits": 1
        "test_size": 0.25
    }
}) -> (ds.pred, "my-model")

This configuration tells us that we want to validate the model during training, and that we want to do so by splitting the data once into two parts (”n_splits”: 1). It also asks that the test split contain 25% of the samples (”test_size”: 0.25), meaning the remaining 75% of samples will be used for training.

In all cases, independent of any configuration, the final model in Graphext will always be trained on all data, so step B in the above diagram is always implicit.

Cross-validation

When we have a lot of data, we may simply reserve a proportion of the data for testing and use the remaining data to fit its parameters, as mentioned above. However, when a dataset is already small, this means fitting the model on an even smaller part of it, which may not be enough given its complexity (usually, the more parameters a model has the more data is needed to optimise it). In addition, evaluating the model on a single random (and small) proportion of the original dataset may result in unreliable estimation (high bias), as it is not guaranteed that the distribution of data in the test part is similar to that in the training part (or to the greater “population” the samples come from).

To remedy this, a common method to evaluate a model’s performance on limited data, and the one used by default in Graphext, is to use cross-validation (CV).

K-fold cross-validation

Perhaps the most common form of cross-validation is the K-fold CV. The idea here is to split the dataset into K folds, and then use K-1 folds for fitting the model and the remaining fold to evaluate the generalisation performance of the model on data it hasn’t seen before.

For example, in a 5-fold cross-validation we divide the dataset into 5 equal-sized, non-overlapping parts, each containing 20% of the samples. We then run 5 iterations and in each:

  • select 4 parts of the dataset (80%) to fit the model
  • select 1 part of the dataset (20%) to evaluate its performance

Cross-validation intro

Cross-validation for model evaluation. The dataset is divided into 5 equal parts (folds). In each iteration we take 4 folds to train the model, and 1 fold to evaluate its performance using some error metric. The estimated generalisation performance then is the average of the metric over the test folds.

We then report the average of the model’s performance on the 5 test folds as the expected performance of the model on unseen data. The advantage of this method is that the model is guaranteed to get evaluated on all available samples.

A complete strategy for using k-fold cross-validation to train and evaluate our model then looks like this:

3-fold CV strategy

3-fold cross-validation strategy. The data is split 3 times into 3 equal parts. In each iteration 2 folds are used for training and 1 for evaluation. The final model is trained on the whole dataset.

We use k-fold cross-validation to estimate the model’s performance, and then fit it again using the whole dataset. The "complexity" of this strategy is thus K + 1, where K is the number of folds in the cross-validation.

Pessimistic Bias

To stress a point already made above, to make best use of all the data available, the final model will always be fitted again on the entire dataset. This means the estimated performance may be slightly pessimistic, as the model may not have reached its maximum capacity when fitted with only ⅘ of the dataset, for example (perhaps with more data the model would do better).

Configuring this in Graphext

In Graphext, we can configure cross-validation simply by omitting the test_size parameter we used in the holdout method:

train_classification(ds, {
  "target": "churn"
    "validate": {
        "n_splits": 3
    },
    "params": {
        "C": 1,
        "gamma": "scale"
    }
}) -> (ds.pred, "my-model")

We don’t need the test_size parameter in the validate section, because k-fold cross-validation splits the datasets into n_splits equal-sized parts always. Conversely, not providing the test_size parameter is how we indicate in Graphext that we want k-fold cross-validation, rather than the shuffle-split method.

Note we also introduced configuration for selecting some of the model’s hyperparameters by hand. If you don’t want to tune them (we will learn how in below sections), you can either leave them at their defaults, or provide constants using the params field.

Repeated holdout cross-validation

An alternative to k-fold cross-validation is to independently split the dataset k times into two random train and test sets. E.g. 5 such shuffle-split iterations, maintaining a proportion of 80% training data and 20% test data, may look like this:

Repeated Holdout

Repeated holdout (shuffle-split) for model evaluation. Instead of dividing the dataset into k equal parts, we simply split it k times into the desired training and testing proportions randomly.

Note that in this case the train/test proportion is independent from the number of splits, i.e. we have the flexibility to e.g. evaluate the model 50 times on random 75%–25% splits of the data. However, it is not guaranteed that the model sees all the data in the process, nor that the splits are different from each other. This method is also sometimes called shuffle-split or Monte Carlo cross-validation.

Schematically, the complete strategy for using repeated holdout validation to train and evaluate our model then would be:

Repeated Holdout Full

5-fold repeated holdout (shuffle-split) strategy. This strategy splits the dataset 5 times into desired training and testing proportions randomly. The final model is trained on the whole dataset.

The complexity of this strategy is also K + 1, as all that's changed is how we split the data, not the number of times we do it.

Holdout as special case of single-split cross-validation

The holdout strategy mentioned above can be seen as a special case of the shuffle-split with a single iteration only.

Single shuffled split
Holdout method as single shuffle-split. We split the dataset only once, e.g. using 80% of samples to train the model, and 20% to evaluate its performance on unseen data.

Note that this is different from the special case of a 2-fold cross-validation, which would also split the dataset only once, but into two equal parts containing 50% of the data each. It would then use two iterations to fit the model on one half while evaluating it on the other:

Single shuffled split
2-fold CV. Although this would effectively split the dataset only once (into two folds), the model will still be evaluated twice and will get to see all data.

Configuring this in Graphext

The repeated holdout (shuffle-split) is configured very similar to the k-fold CV in Graphext. We simply specify the desired test_size of each iteration.

train_classification(ds, {
  "target": "churn"
    "validate": {
        "n_splits": 3
        "test_size": 0.2
    },
    "params": {
        "C": 1,
        "gamma": "scale"
    }
}) -> (ds.pred, "my-model")

Model tuning and evaluation

Many ML models have so-called hyperparameters that determine exactly how the model learns from data. Basic regression models e.g. fit their coefficients to data such as to minimise a certain loss metric. A regularised regression additionally implements a penalty on the coefficients (e.g. to keep the coefficients small, or to use fewer coefficients if possible). The strength of this penalty is one such hyperparameter. The maximum allowed depth of a decision tree, or the number of decision trees in a random forest are other examples.

If you’re lucky, the model in question works well out of the box, with all hyperparameters at their default values. Or if you have a lot of experience training a specific kind of model, you may have some intuition about values that work best in specific scenarios. If neither is the case, or you feel your model could or should perform better than what you’re seeing with the default hyperparameters, you may want to tune them. Tuning here simply means finding their values automatically, and such that the performance of the model is optimised. In practice, this means using data to select the best from a number of candidate models having different hyperparameter values.

We may be tempted to simply use the same methodology as explained above to find the best hyperparameters and evaluate our model’s performance. We could e.g. fit 3 different model candidates using k-fold cross-validation and select the one that on average had the best performance. The question then arises what its estimated performance would be on unseen data. If we simply reported the average from our cross-validation, we would be cheating. The estimate would be biased, because we have used the same data to identify the best model (i.e. to select from our candidates and train it), and to estimate its performance. I.e. we haven’t reserved any data to stand in for future, unseen data.

The correct way to both tune a model’s hyperparameters and estimate its generalisation performance, is to use nested cross-validation. The general idea is to evaluate the complete training procedure (hyperparameter selection and model fitting) as we would do in a normal cross-validation, but in each iteration of the evaluation, we split the training set again using an inner cross-validation loop to pick the hyperparameters in a robust way.

But instead of directly jumping in to this rather complex strategy, let's build towards it step-by-step starting from simpler strategies.

No evaluation (don’t do this)

As we mentioned above, we don’t recommend fitting or tuning a model without evaluating its generalisation performance. Do this only if you plan to collect more data and evaluate the model later on. Having said that, we can tune a model, using the holdout method or cross-validation, and simply report the same performance we used to pick our hyperparameters as a (bad) estimate of future performance.

Holdout method for tuning without evaluation

The simplest method for tuning our hyperparameters would be to use a single holdout set to pick the best from a set of candidate hyperparameter settings:

Single shuffled split

Holdout tuning without model evaluation. We select a winning hyperparameter setting by training all candidates on a single train split and comparing their performance on a single test split. We report the winner’s performance (or the average across candidates) as our generalisation metric, and then use its hyperparameters to train the final model on all available data.

This is analogous to the simple holdout method mentioned above, but instead of evaluating a single model and reporting its performance on the test set, we evaluate multiple model candidates (hyperparameter combinations), and use the test set to pick the best among them.

We might perhaps be tempted then to report either the average performance on the test set, or the best model's performance as our expected generalization ability. But this wouldn't be a good idea. The estimated performance will likely be optimistic, as we used the same data to select between models and to evaluate their generalisation. I.e., we haven’t tested our model training strategy on unseen data.

In this strategy our model needs to be fit to data H + 1 times, where H is the number of different hyperparameter combinations to try (H candidates on the training split, and the final model on all data).

Configuring this in Graphext

To tune hyperparameters in Graphext, simply add a tune section to the step’s configuration containing the names and ranges of parameters to explore, like so:

train_classification(ds, {
  "target": "churn"
    "tune": {
        "strategy": "grid",
        "params": {
                "C": [1, 10],
                "gamma": ["scale", "auto"]
        },
        "validate": {
            "n_splits": 1,
            "test_size": 0.8
        } 
    },
}) -> (ds.pred, "my-model")

Note that the validate section in this code snippets is located inside the tune section. This is to indicate that this is the strategy we want to use to select between different hyperparameter settings, not to evaluate generalisation performance. It has the same name validate, because it accepts exactly the same parameters (n_splits, test_size etc.).

The above configuration will split the dataset once into 80% of samples to be used to train our hyperparameter candidates, and 20% to evaluate and pick the winner. Any performance metrics reported back in the Models section will be biased and optimistic, since we haven’t reserved any data for testing.

Cross-validation for tuning without evaluation

We can also use cross-validation instead of the holdout method to select the model's hyperparameters without (properly) evaluating its performance:

Single shuffled split

Cross-validation for tuning without model evaluation. We pick hyperparameters by comparing candidate models using cross-validation on the whole dataset, then train the final model with these hyperparameters using all data. “Estimated” performance is the average performance of the winning model from tuning stage.

This works the same as the holdout for model tuning, but instead of comparing different hyperparameter combinations on a single train/test split of the data, we compare them using the average over multiple folds of the data. Note, that since we still don't test our procedure on unseen data, the same caveats regarding bias and overly optimistic metrics apply here as well.

In this strategy the model needs to be fit (K * H) + 1 times in total, for H different candidates and K folds in our cross-validation. A 5-fold cross-validation for exploring 4 different hyperparameter combinations, for example, would result in a total of 21 model fits.

Configuring this in Graphext

As before, in Graphext we simply omit the test_size parameter and select the number of splits (n_splits) to be used for cross-validation:

train_classification(ds, {
  "target": "churn",
    "tune": {
        "strategy": "grid",
        "params": {
                "C": [1, 10],
                "gamma": ["scale", "auto"]
        },
        "validate": {
            "n_splits": 3,
        } 
    },
}) -> (ds.pred, "my-model")

Three-way holdout

The simplest strategy to tune and evaluate a model on unseen data is the three-way holdout. This is a simple extension of the holdout method mentioned in the beginning. It splits the dataset once into dedicated training, validation and test sets. This setup is often used in deep learning contexts, where fitting a single model is very expensive but datasets are huge:

Single shuffled split

Three-way holdout for model tuning and evaluation. A/ Train candidate models (hyperparameter combinations) on train set. Pick winner by evaluating on eval set. Measure performance of winner on test set. This is the estimated generalisation performance. B/ To train final model, first pick a winner again by fitting models to combined train and eval splits and select based on performance on test set. C/ Train the winner on entire dataset as final model.

We evaluate the performance of our model training strategy by:

  1. fitting our candidate models on the training set
  2. picking the best candidate by evaluating them using the evaluation set
  3. calculating the final score of the winning candidate on the test set after having re-fit it on the combined training and evaluation sets

Having an estimate of our model’s performance, we can then apply the same strategy of picking the best hyperparameters using a simple two-way holdout split (combining the train and eval sets), and finally train the model using the best hyperparameters on the whole dataset.

Another way to describe the same strategy, one more closely matching the implementation and configuration in Graphext, would be to say that we split the dataset once into training and test splits for evaluation, and then train the model (including the tuning of hyperparameters), by splitting the training set again into train and evaluation splits.

In this strategy the model needs to be fit (H + 1) + H + 1 = 2H + 2 times in total, for H different hyperparameter combinations. Selecting between 4 different candidates, for example, would result in a total of 10 model fits.

Configuring this in Graphext

In Graphext a single random holdout split is configured by setting "n_splits": 1 and selecting the proportion allocated for testing (”test_size”: 0.2, e.g.). Since we want to use a single split to pick our hyperparameters, and a single split again for evaluating, we can combine these in the inner and outer validate sections of the configuration:

train_classification(ds, {
    "target": "churn",
    "tune": {
        "strategy": "grid",
        "params": {
                "C": [1, 10],
                "gamma": ["scale", "auto"]
        },
        "validate": {
            "n_splits": 1,
            "test_size": 0.2
        } 
    },
    "validate": {
        "n_splits": 1,
        "test_size": 0.25
    }
}) -> (ds.pred, "my-model")

This would reserve 25% of data for testing. The remaining 75% would be split again into 80% for training and 20% for validation (model selection).

Holdout cross-validation

Slightly more robust than the previous version, this is essentially the holdout method for model evaluation (dedicated train and test sets), but using cross-validation for tuning by splitting the training set repeatedly (in the previous method we simply split it once):

Single shuffled split

CV for model tuning and holdout for evaluation. A/ We split the dataset once into test and tuning sets. We select the best hyperparameters using grid search (e.g.) and cross-validation on the tuning set. After being refit again on the whole tuning set, the winner is evaluated on the held out test set. The result is our estimate of generalisation performance. B/ We use the same grid-search with CV approach on the entire dataset to pick our final hyperparameters. C/ Using the best hyperparameters, we fit the final model on the entire dataset.

Note that by definition we have only used a single split to evaluate our whole training procedure, which in this case includes hyperparameter tuning. This can result in a biased estimate of the model's performance. If the dataset as a whole is larger enough, and with it the holdout set, it may be sufficient. Otherwise we can address this with the nested cross-validation approach explained in next sections.

Using a K-fold cross-validation to select between H different hyperparameter candidates, in this strategy the model needs to be fit (K * H) + 1 times to estimate its performance, another (K * H) times to pick the best hyperparameters, and a final time to fit the best model using all data. This makes for a total of 2HK + 2 model fits. With K=5 and H=4, for example, this adds up to 42 fitting iterations.

Configuring this in Graphext

In Graphext, to use a single shuffle-split partitioning of the dataset for evaluation, select "n_splits": 1 and a test_size parameter in the outer validate section. To use cross-validation in the inner loop for hyperparameter selection, provide only the number of desired folds ("n_splits": 5 here):

train_classification(ds, {
    "target": "churn",
    "tune": {
        "strategy": "grid",
        "params": {
                    "C": [1, 10],
                    "gamma": ["scale", "auto"]
            },
        "validate": {
            "n_splits": 5
        } 
    },
    "validate": {
        "n_splits": 1,
        "test_size": 0.2
    }
}) -> (ds.pred, "my-model")

The above would reserve a random 20% of holdout data for testing, and use the remaining 80% for training, where training consists of a 5-fold cross-validation to select the best hyperparameters.

Nested cross-validation

Nested cross-validation addresses the issue of wanting to use cross-validation for both

  • reliably picking hyperparameters (instead of relying on a single split to select the winner)
  • estimating the expected performance of the final model

without leaking data used in hyperparameter selection into our estimations.

This is somewhat tricky to get right. Perhaps the easiest way to understand nested cross-validation is to treat the tuning of the model’s hyperparameters (candidate selection) as part of the regular training procedure. In essence, we treat our model as a kind of meta-model, which now consists of its normal internal parameters as well as its hyperparameters, and training the model simply means fitting both types of parameters given some data.

Seen this way, evaluating the generalisation performance of our meta-model does indeed simply consist of a k-fold cross-validation as explained above. E.g. we split the data into 5 equal parts and then iteratively use 4 parts to identify the best model (hyperparameters), and the fifth part to test its performance:

Single shuffled split

Model evaluation with hyperparameter tuning. From a higher-level perspective this is just the normal CV. But now, we use each training fold to find the best hyperparameters of our model.

The average of the 5 folds then is how we expect our combined hyperparameter tuning and model fitting procedure to perform on unseen data. Note that the best model, i.e. the best combination of hyperparameters, may be different in each of the k iterations. But this doesn’t matter. We are not selecting any of the “winners” from each iteration as our overall best model. The only purpose of the cross-validation is to estimate the generalisation performance of our overall training procedure, which now includes the hyperparameter tuning.

Inner loop

But, how exactly do we select the best model in each iteration of our cross-validation loop? As the name suggests, in nested cross-validation we use an outer loop to evaluate our overall model training procedure, and a second inner loop to select hyperparameters (tuning). I.e. for each outer loop iteration, we split the training set again into k parts, use k-1 parts to fit our different candidates (hyperparameter combinations), and the kth part to evaluate each candidate’s generalisation, like so:

Single shuffled split

Nested cross-validation. Example of a 5-3 nested CV. In the outer-loop we split the dataset 5 times into two parts: 1/ samples used for testing generalisation performance, and 2/ samples used for tuning our model (i.e. for selecting the best hyperparameters). In a second inner loop, we split the tuning set again 3 times into two parts that are used for 1/ fitting our candidates with different hyperparameters train and 2/ selecting the winner among them using their average performance on the evaluation samples. The winner of the inner loop is then fit again on the whole tune set of the outer loop before being tested on the test set.

In each inner loop, we then select as the winner the model with the best average performance across the evaluation folds. We train this model again using all data in the outer loop’s tuning set, evaluate it on the test fold, and report the average of all winners across the outer loop as our best estimate of the overall training procedure.

Candidate selection

We can zoom further in to get a clearer idea of how the best candidate is selected in each outer loop iteration. Here is one such iteration:

Single shuffled split

Detail of a single CV outer loop iteration. A single outer fold (tune is used to select between 4 different candidate models (hyperparameter combinations). The candidate with best average performance across all eval folds is selected as the winner. The winning hyperparameter combination is then trained again using all data in the tune fold, and evaluated on the test fold. The whole procedure is then repeated for all outer loop iterations, and the average across all winners on the test folds is our final estimate of the tuned model’s generalisation performance.

If we wanted to tune e.g. 2 hyperparameters of our model, and for each parameter try 2 different values, this would lead to 4 candidate models (4 different combinations of hyperparameters). We may e.g. train a support vector machine with regularisation strengths C in {1,10} and values of γ in {scale, auto}. As shown in the figures above, the winner of each outer loop is the candidate with the best average performance across the inner loop evaluation folds.

Final model

As mentioned above, we don’t select any of the winners from our nested cross-validation as our final model. As in the simple case, once we have an estimate of future generalisation performance, we want to make sure we use as much data as is available to fit our final model.

Now, keeping in mind again that training our “meta-model” consists of tuning the model’s hyperparameters using cross-validation (the inner loop basically), our final model is trained exactly like that:

  • Use a single k-fold cross-validation over the whole dataset to pick the best hyperparameters
  • Use these hyperparameters to train a single model on all available data

Summary

Schematically, then, the whole procedure of using nested cross-validation for model tuning and evaluation looks like this:

Single shuffled split

Overview of model tuning and evaluation using nested cross-validation. A/ using nested cross-validation to evaluate the generalisation performance of tuning and fitting the model. B Using the same inner cross-validation on the entire dataset to pick hyperparameters for the final model. C/ Using the best hyperparameters to train the final model on the entire dataset.

Whenever you don’t have a huge amount of data, and execution time is not a great concern, we recommend to go for the “full monty” and use nested cross-validation for model selection and evaluation. It can be somewhat slow though. If H is the number of different hyperparameter combinations to try, and N, K the number of folds in the outer and inner cross-validation loops, this requires fitting the model:

  • N * K * H times to estimate performance
  • K * H times to select the best hyperparameters
  • 1 time to train the final model

This makes for a total of NKH + KH + 1 fitting iterations. For example, a 3x5 nested cross-validation and 4 different candidate models would result in 96 model fits.

Nested cross-validation for coders

From a coding perspective, and taking scikit-learn as an example, our “meta-model” corresponds to simply wrapping our original model in a GridSearchCV object (which internally uses cross-validation to find the best hyperparameters among a set of candidates). We then use a simple cross-validation to evaluate the meta-model’s generalisation performance, and refit it to the whole dataset to create the final model (also see complete example in scikit-learn):

inner_cv = KFold(n_splits=3, shuffle=True, random_state=i)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=i)

hyper_params = {"C": [1, 10, 100], "gamma": [0.01, 0.1]}

model = SVC(kernel="rbf")
metamodel = GridSearchCV(estimator=model, param_grid=hyper_params, cv=inner_cv, refit=True)

generalization_score = cross_val_score(metamodel, X, y, cv=outer_cv)
final_model = metamodel.fit(X, y, refit=True)

Here, cross_val_score estimates generalisation of our entire model training and selection procedure (the grid-search CV). The final metamodel.fit() (here GridSearchCV’s fit()) then picks the best hyperparameters using the whole dataset, and refits this best model again on the whole dataset.

Configuring this in Graphext

Since we want to use k-fold cross-validation in both the outer loop (evaluation) and the inner loop (hyperparameter selection), we simply select the desired number of splits for both:

train_classification(ds, {
    "target": "churn",
    "tune": {
        "strategy": "grid",
        "params": {
            "C": [1, 10],
            "gamma": ["scale", "auto"]
        },
        "validate": {
            "n_splits": 3
        } 
    },
    "validate": {
        "n_splits": 5
    }
}) -> (ds.pred, "my-model")

This would split the data 5 times (5-fold cross-validation) in the outer loop to evaluate generalisation performance, and in the inner loop (”tune”) use grid search with 3-fold cross-validation to select hyperparameters.

Evaluation only

If we already have a trained model we may simply want to evaluate it again on new data; perhaps to test whether it still performs as intended, or whether data drift may have led to a degradation in performance.

This is not possible at the moment in Graphext, but will be in the future using a dedicated step (something like test_classification instead of train_classification e.g.).

Summary

We have seen that how to train and evaluate a ML model depends on at least two decisions:

  1. whether to include the selection of hyperparameters in the training (tuning)
  2. selecting a simple holdout strategy (single split of the dataset intro training and test sets), or a more robust cross-validation (multiple k-fold or shuffle-split iterations).

Both decisions constitute a tradeoff between execution time, resulting model performance and robustness of the estimated model performance.

Tuning of hyperparameters should in principle be at least as good, but in most cases hopefully better than not tuning. It may take significantly more time, though, since reliably picking the best hyperparameters depends on fitting the same model to different splits of the dataset many times.

Picking a simple holdout method is faster than cross-validation, since it uses fewer iterations over dataset splits. But for it to be reliable, it is best to use it when the dataset is on the large side. If robustness of the estimated performance is important, and the dataset not large, cross-validation may be more advisable.

There are no hard rules unfortunately, but as a simple heuristic, if execution time is not a great concern, prefer cross-validation over the simpler holdout. Taking inspiration from Sebastian Raschka’s summary in “Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning" (Raschka, 2018), we can summarize the options available in Graphext in the following figure:

Single shuffled split

Summary of recommended model tuning and evaluation strategies. "Repeated holdout" is synonymous with the "shuffle-split" method, and nested cross-validation may be of the k-fold or shuffle-split kind.

As for how to configure Graphext model training, this can be summarised succinctly now:

  • To pick a cross-validation strategy (to estimate performance or pick hyperparameters), use
    "validate": {"n_splits": n}
  • To pick a shuffle-split strategy:
    "validate": {"n_splits": n, "test_size": x}
  • The simple holdout method is a special case of shuffle-split using a single split only:
    "validate": {"n_splits": 1, "test_size": x}

To configure both, the splitting strategy used for evaluation and for tuning, use the same parameters inside and outside the ”tune” section, like so:

train_classification(ds, {
    "target": "churn",
    "tune": {
        "strategy": "grid",
        "params": {
            "C": [1, 10],
            "gamma": ["scale", "auto"]
        },
        "validate": {
            "n_splits": 5
        } 
    },
    "validate": {
        "n_splits": 1,
    "test_size": 0.2
    }
}) -> (ds.pred, "my-model")

This will use 5-fold cross-validation to pick the hyperparameters, and a single holdout split with 20% of the sample for evaluation.

Stratification

Many classification problems are defined by target variables (labels) with considerable imbalance. The different classes of the target variable are represented in significantly different proportions in the dataset. In this case, it is usually a good idea when splitting the dataset to try and maintain the same proportions in each split, a method called stratified sampling. Graphext applies stratified k-fold cross-validation or shuffle-split by default when the target variable is categorical (classification).

The intention of this overview has been to introduce the principal ways that Graphext trains and evaluates ML models. We haven’t touched on many finer details regarding the tradeoffs in bias and variance of the mentioned strategies etc. Check the below reference for a deeper scientific understanding of these and related method.

References

Conceptual Overview

Academic

  • On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation (Cawley & Talbot, 2009).
  • Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap (Kim, 2019).

Scikit-learn

Stack Exchange