Training a model without evaluating it. Don't try this at home!
X
(NxM samples) and y
(N labels).
So in the simplest case, we simply pass our model all available samples and labels to learn from. We will have no idea if it’s predictions are any good. Or not yet. The only imaginable use case for this would be if additional data for evaluation would become available later, separately, so that at this point in time all we can do is fit our model blindly.
If we measure the “complexity” of our training strategy by the number of times a model is fit to data, this simplest strategy has a complexity of 1, since we fit the model exactly once using all data.
Configuring this in Graphext
target
parameter.If no particular model (CatBoost, linear regression etc.) is configured, Graphext will automatically select a default (best) model for the task (classification/regression). The task itself will be determined by the data type of the target column (the labels): classification if the target is categorical (or boolean), and regression if it is numerical.The above example, e.g., will train a CatBoost classifier to predict the churn
variable in the dataset ds
. It will output predictions for the very same samples used to train it (as a new column pred
in the dataset ds
, and save the model under the name “my-model”
for future use.Since we haven’t asked for model evaluation, and since we have used all data to train our model, the only thing we can measure is how well the model can predict labels for the same samples used to train it. By default Graphext will pick some appropriate metrics for the task and report these as the “train metrics” in the Models section of your project. Note that these are useless as estimates of the model’s real performance. Their only purpose is to gain some insights into whether the model was able to learn anything at all from the data. I.e., if it’s accuracy is bad even on the training set, then either the data doesn’t containing any learnable patterns, or the model is not powerful enough to find them (or to memorise them).Holdout method for model evaluation. This strategy has two steps. A/ Split the dataset into two parts, the train and the test split. Train the model using samples in the train split. Then evaluate the model using a metric of choice on samples and corresponding labels in the test split. B/ Make use of the entire dataset to train the final model.
Shuffling
A single shuffled split of a dataset.
Configuring this in Graphext
validate
the model during training, and that we want to do so by splitting the data once into two parts (”n_splits”: 1
). It also asks that the test split contain 25% of the samples (”test_size”: 0.25
), meaning the remaining 75% of samples will be used for training.In all cases, independent of any configuration, the final model in Graphext will always be trained on all data, so step B in the above diagram is always implicit.Cross-validation for model evaluation. The dataset is divided into 5 equal parts (folds). In each iteration we take 4 folds to train the model, and 1 fold to evaluate its performance using some error metric. The estimated generalisation performance then is the average of the metric over the test folds.
3-fold cross-validation strategy. The data is split 3 times into 3 equal parts. In each iteration 2 folds are used for training and 1 for evaluation. The final model is trained on the whole dataset.
Configuring this in Graphext
test_size
parameter we used in the holdout method:test_size
parameter in the validate
section, because k-fold cross-validation splits the datasets into n_splits
equal-sized parts always. Conversely, not providing the test_size
parameter is how we indicate in Graphext that we want k-fold cross-validation, rather than the shuffle-split method.Note we also introduced configuration for selecting some of the model’s hyperparameters by hand. If you don’t want to tune them (we will learn how in below sections), you can either leave them at their defaults, or provide constants using the params
field.Repeated holdout (shuffle-split) for model evaluation. Instead of dividing the dataset into k equal parts, we simply split it k times into the desired training and testing proportions randomly.
5-fold repeated holdout (shuffle-split) strategy. This strategy splits the dataset 5 times into desired training and testing proportions randomly. The final model is trained on the whole dataset.
Holdout as special case of single-split cross-validation
Holdout method as single shuffle-split. We split the dataset only once, e.g. using 80% of samples to train the model, and 20% to evaluate its performance on unseen data.
2-fold CV. Although this would effectively split the dataset only once (into two folds), the model will still be evaluated twice and will get to see all data.
Configuring this in Graphext
test_size
of each iteration.Holdout tuning without model evaluation. We select a winning hyperparameter setting by training all candidates on a single <b>train</b> split and comparing their performance on a single <b>test</b> split. We report the winner’s performance (or the average across candidates) as our generalisation metric, and then use its hyperparameters to train the final model on all available data.
Configuring this in Graphext
tune
section to the step’s configuration containing the names and ranges of parameters to explore, like so:validate
section in this code snippets is located inside the tune
section. This is to indicate that this is the strategy we want to use to select between different hyperparameter settings, not to evaluate generalisation performance. It has the same name validate
, because it accepts exactly the same parameters (n_splits
, test_size
etc.).The above configuration will split the dataset once into 80% of samples to be used to train our hyperparameter candidates, and 20% to evaluate and pick the winner. Any performance metrics reported back in the Models section will be biased and optimistic, since we haven’t reserved any data for testing.Cross-validation for tuning without model evaluation. We pick hyperparameters by comparing candidate models using cross-validation on the whole dataset, then train the final model with these hyperparameters using all data. “Estimated” performance is the average performance of the winning model from tuning stage.
Configuring this in Graphext
test_size
parameter and select the number of splits (n_splits
) to be used for cross-validation:Three-way holdout for model tuning and evaluation. A/ Train candidate models (hyperparameter combinations) on train set. Pick winner by evaluating on eval set. Measure performance of winner on test set. This is the estimated generalisation performance. B/ To train final model, first pick a winner again by fitting models to combined train and eval splits and select based on performance on test set. C/ Train the winner on entire dataset as final model.
Configuring this in Graphext
"n_splits": 1
and selecting the proportion allocated for testing (”test_size”: 0.2
, e.g.). Since we want to use a single split to pick our hyperparameters, and a single split again for evaluating, we can combine these in the inner and outer validate
sections of the configuration:CV for model tuning and holdout for evaluation. A/ We split the dataset once into test and tuning sets. We select the best hyperparameters using grid search (e.g.) and cross-validation on the tuning set. After being refit again on the whole tuning set, the winner is evaluated on the held out test set. The result is our estimate of generalisation performance. B/ We use the same grid-search with CV approach on the entire dataset to pick our final hyperparameters. sing the best hyperparameters, we fit the final model on the entire dataset.
Configuring this in Graphext
"n_splits": 1
and a test_size
parameter in the outer validate section. To use cross-validation in the inner loop for hyperparameter selection, provide only the number of desired folds ("n_splits": 5
here):Model evaluation with hyperparameter tuning. From a higher-level perspective this is just the normal CV. But now, we use each training fold to find the best hyperparameters of our model.
Nested cross-validation. Example of a 5-3 nested CV. In the outer-loop we split the dataset 5 times into two parts: 1/ samples used for testing generalisation performance, and 2/ samples used for tuning our model (i.e. for selecting the best hyperparameters). In a second inner loop, we split the tuning set again 3 times into two parts that are used for 1/ fitting our candidates with different hyperparameters train and 2/ selecting the winner among them using their average performance on the evaluation samples. The winner of the inner loop is then fit again on the whole tune set of the outer loop before being tested on the test set.
Detail of a single CV outer loop iteration. A single outer fold (tune is used to select between 4 different candidate models (hyperparameter combinations). The candidate with best average performance across all eval folds is selected as the winner. The winning hyperparameter combination is then trained again using all data in the tune fold, and evaluated on the test fold. The whole procedure is then repeated for all outer loop iterations, and the average across all winners on the test folds is our final estimate of the tuned model’s generalisation performance.
{1,10}
and values of γ in {scale, auto}
. As shown in the figures above, the winner of each outer loop is the candidate with the best average performance across the inner loop evaluation folds.
Overview of model tuning and evaluation using nested cross-validation. A/ using nested cross-validation to evaluate the generalisation performance of tuning and fitting the model. B Using the same inner cross-validation on the entire dataset to pick hyperparameters for the final model. C/ Using the best hyperparameters to train the final model on the entire dataset.
Nested cross-validation for coders
GridSearchCV
object (which internally uses cross-validation to find the best hyperparameters among a set of candidates). We then use a simple cross-validation to evaluate the meta-model’s generalisation performance, and refit it to the whole dataset to create the final model (also see complete example in scikit-learn):cross_val_score
estimates generalisation of our entire model training and selection procedure (the grid-search CV). The final metamodel.fit()
(here GridSearchCV’s fit()
) then picks the best hyperparameters using the whole dataset, and refits this best model again on the whole dataset.Configuring this in Graphext
test_classification
instead of train_classification
e.g.).
Summary of recommended model tuning and evaluation strategies. 'Repeated holdout' is synonymous with the 'shuffle-split' method, and nested cross-validation may be of the k-fold or shuffle-split kind.
"validate": {"n_splits": n}
"validate": {"n_splits": n, "test_size": x}
"validate": {"n_splits": 1, "test_size": x}
Stratification