Note that the configuration parameters depend on the specific model trained. In the parameters section below, each allowed model has its own section.

The output will always be a new column with the trained model’s predictions on the training data, as well as a saved and named model file that can be used in other projects for prediction of new data.

A detailed guide on how to configure this step for model tuning and performance evaluation can be found here.

model
string
default: "CatboostRegressor"

Train a Catboost regressor. I.e. gradient boosted decision trees with support for categorical variables and missing values.

target
string
required

Target variable (labels). Name of the column that contains your target values (labels).

feature_importance
[string, boolean, null]
default: "native"

Importance of each feature in the model. Whether and how to measure each feature’s contribution to the model’s predictions. The higher the value, the more important the feature was in the model. Only relative values are meaningful, i.e. the importance of a feature relative to other features in the model.

Also note that feature importance is usually meaningful only for models that fit the data well.

The default (null, true or "native") uses the classifier’s native feature importance measure, e.g. prediction-value-change in the case of Catboost, Gini importance in the case of scikit-learn’s DecisionTreeClassifier, and the mean of absolute coefficients in the case of logistic regression.

When set to "permutation", uses permutation importance, i.e. measures the decrease in model score when a single feature’s values are randomly shuffled. This is considerably slower than native feature importance (the model needs to be evaluated an additional k*n times, where k is the number of features and n the number of repetitions to average over). On the positive side it is model-agnostic and doesn’t suffer from bias towards high cardinality features (like some tree-based feature importances). On the negative side, it can be sensitive to strongly correlated features, as the unshuffled correlated variable is still available to the model when shuffling the original variable.

When set to false, no feature importance will be calculated.

Values must be one of the following:

  • True
  • False
  • native
  • permutation
  • null
encode_features
boolean
default: "true"

Toggle encoding of feature columns. When enabled, Graphext will auto-convert any column types to the numeric type before fitting the model. How this conversion is done can be configured using the feature_encoder option below.

If disabled, any model trained in this step will assume that input data is already in an appropriate format (e.g. numerical and not containing any missing values).
feature_encoder
[null, object]

Configures encoding of feature columns. By default (null), Graphext chooses automatically how to convert any column types the model may not understand natively to a numeric type.

A configuration object can be passed instead to overwrite specific parameter values with respect to their default values.

include_text_features
boolean

Whether to include or ignore text columns during the processing of input data. Enabling this will convert texts to their TfIdf representation. Each text will be converted to an N-dimensional vector in which each component measures the relative “over-representation” of a specific word (or n-gram) relative to its overall frequency in the whole dataset. This is disabled by default because it will often be better to convert texts explicitly using a previous step, such as embed_text or embed_text_with_model.

params
object

CatBoost configuration parameters. You can check the official documentation for more details about Catboost’s parameters here.

validate
[object, null]

Configure model validation. Allows evaluation of model performance via cross-validation using custom metrics. If not specified, will by default perform 5-fold cross-validation with automatically selected metrics.

tune
object

Configure hypertuning. Configures the optimization of model hyper-parameters via cross-validated grid- or randomized search.

seed
integer

Seed for random number generator ensuring reproducibility.

Values must be in the following range:

0 ≤ seed < inf