Non-missing values in a categorical target column will be used to train a prediction model (a Catboost classifier), which then predicts (imputes) the missing values. The step produces two output columns: one containing predicted classes for all rows, and a second containing a probability for each predicted class.

target
string
required

Name of the categorical column to impute. The step will predict the missing (and non-missing) class labels for this column, using rows in the dataset where the values are not missing to train the prediction model.

infer_all
boolean

Predict non-missing values. When set to true, all values are predicted. Set this param to false to maintain original values when they are not missing.

threshold
number

Confidence threshold. Every prediction with probability strictly below this threshold will be set to NaN (missing).

Values must be in the following range:

0 ≤ threshold < 1
params
object

CatBoost configuration parameters. You can check the official documentation for more details about Catboost’s parameters here.

encode_features
boolean
default: "true"

Toggle encoding of feature columns. When enabled, Graphext will auto-convert any column types to the numeric type before fitting the model. How this conversion is done can be configured using the feature_encoder option below.

If disabled, any model trained in this step will assume that input data is already in an appropriate format (e.g. numerical and not containing any missing values).
feature_encoder
[null, object]

Configures encoding of feature columns. By default (null), Graphext chooses automatically how to convert any column types the model may not understand natively to a numeric type.

A configuration object can be passed instead to overwrite specific parameter values with respect to their default values.

include_text_features
boolean

Whether to include or ignore text columns during the processing of input data. Enabling this will convert texts to their TfIdf representation. Each text will be converted to an N-dimensional vector in which each component measures the relative “over-representation” of a specific word (or n-gram) relative to its overall frequency in the whole dataset. This is disabled by default because it will often be better to convert texts explicitly using a previous step, such as embed_text or embed_text_with_model.

validate
[object, null]

Configure model validation. Allows evaluation of model performance via cross-validation with custom metrics. If not specified, will by default perform 5-fold cross-validation with automatically selected metrics.

tune
object

Configure hypertuning. Configures the optimization of model hyper-parameters via cross-validated grid- or randomized search.

seed
integer

Seed for random number generator ensuring reproducibility.

Values must be in the following range:

0 ≤ seed < inf