The output will consist of a new column with the trained model’s predictions on the training data, as well as a saved and named model file that can be used in other projects for prediction of new data.

Optionally, if a second output column name is provided, the model’s predicted probabilities will also be returned.

A detailed guide on how to configure this step for model tuning and performance evaluation can be found here.

model
string
default: "CatboostClassifier"

Train a Catboost classifier. I.e. gradient boosted decision trees with support for categorical variables and missing values.

target
string
required

Target variable (labels). Name of the column that contains your target values (labels).

positive_class
[string, null]

Name of the positive class. In binary classification, usually the class you’re most interested in, for example the label/class corresponding to successful lead conversion in a lead score model, the class corresponding to a customer who has churned in a churn prediction model, etc.

If provided, will automaticall measure the performance (accuracy, precision, recall) of the model on this class, in addition to averages across all classes. If not provided, only summary metrics will be reported.

max_classes
integer
default: "10"

Maximum number of classes in the target variable. If there are more classes than this, the least frequent classes will be grouped together into a single class called “others”. Reducing the number of classes in the target variable can help improve model performance, especially when the number of classes is very large, some classes are very rare, or the dataset doesn’t have sufficient samples for all classes. Raising this significantly might lead to much longer training times.

Values must be in the following range:

2 ≤ max_classes ≤ 100
feature_importance
[string, boolean, null]
default: "native"

Importance of each feature in the model. Whether and how to measure each feature’s contribution to the model’s predictions. The higher the value, the more important the feature was in the model. Only relative values are meaningful, i.e. the importance of a feature relative to other features in the model.

Also note that feature importance is usually meaningful only for models that fit the data well.

The default (null, true or "native") uses the classifier’s native feature importance measure, e.g. prediction-value-change in the case of Catboost, Gini importance in the case of scikit-learn’s DecisionTreeClassifier, and the mean of absolute coefficients in the case of logistic regression.

When set to "permutation", uses permutation importance, i.e. measures the decrease in model score when a single feature’s values are randomly shuffled. This is considerably slower than native feature importance (the model needs to be evaluated an additional k*n times, where k is the number of features and n the number of repetitions to average over). On the positive side it is model-agnostic and doesn’t suffer from bias towards high cardinality features (like some tree-based feature importances). On the negative side, it can be sensitive to strongly correlated features, as the unshuffled correlated variable is still available to the model when shuffling the original variable.

When set to false, no feature importance will be calculated.

Values must be one of the following:

  • True
  • False
  • native
  • permutation
  • null
encode_features
boolean
default: "true"

Toggle encoding of feature columns. When enabled, Graphext will auto-convert any column types to the numeric type before fitting the model. How this conversion is done can be configured using the feature_encoder option below.

If disabled, any model trained in this step will assume that input data is already in an appropriate format (e.g. numerical and not containing any missing values).
feature_encoder
[null, object]

Configures encoding of feature columns. By default (null), Graphext chooses automatically how to convert any column types the model may not understand natively to a numeric type.

A configuration object can be passed instead to overwrite specific parameter values with respect to their default values.

include_text_features
boolean

Whether to include or ignore text columns during the processing of input data. Enabling this will convert texts to their TfIdf representation. Each text will be converted to an N-dimensional vector in which each component measures the relative “over-representation” of a specific word (or n-gram) relative to its overall frequency in the whole dataset. This is disabled by default because it will often be better to convert texts explicitly using a previous step, such as embed_text or embed_text_with_model.

params
object

CatBoost configuration parameters. You can check the official documentation for more details about Catboost’s parameters here.

validate
[object, null]

Configure model validation. Allows evaluation of model performance via cross-validation using custom metrics. If not specified, will by default perform 5-fold cross-validation with automatically selected metrics.

tune
object

Configure hypertuning. Configures the optimization of model hyper-parameters via cross-validated grid- or randomized search.

seed
integer

Seed for random number generator ensuring reproducibility.

Values must be in the following range:

0 ≤ seed < inf
sort
object

Sort the data before training. If the data is not already sorted by time, you can sort it here. This is useful when you want to split the data by time, for example to train on older data and test on newer data (see the time_split parameter in validation configurations). If the data is already sorted by time, you can ignore this parameter.