Train and use a machine learning model to predict (impute) the missing values in a column.
Examples
ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Inputs
Outputs
step(..., {"param": "value", ...}) -> (output)
.
Parameters
Properties
Forbidden
Min
Max
Ordered
Plain
feature_encoder
option below.null
), Graphext chooses automatically how to convert any column types the model
may not understand natively to a numeric type.A configuration object can be passed instead to overwrite specific parameter values with respect
to their default values.Properties
Properties
Mean
Median
MostFrequent
Const
None
Standard
Robust
KNN
None
scaler
function.
Details depend no the particular scaler used.Options
MostFrequent
Const
None
OneHot
Label
Ordinal
Binary
Frequency
None
Standard
Robust
KNN
None
list[category]
for short). May contain either a single configuration for
all multilabel variables, or two different configurations for low- and high-cardinality variables.
For further details pick one of the two options below.Options
Binarizer
TfIdf
None
Euclidean
KNN
Norm
None
Properties
Array items
day
dayofweek
dayofyear
hour
minute
month
quarter
season
second
week
weekday
weekofyear
year
Array items
day
dayofweek
dayofyear
hour
month
Mean
Median
MostFrequent
Const
None
Standard
Robust
KNN
None
Euclidean
KNN
Norm
None
list[number]
for short).include_text_features
below to active it.Properties
Euclidean
KNN
Norm
None
embed_text
or embed_text_with_model
.Properties
n_splits
times, train on the former
and evaluate on the latter using specified or automatically selected metrics
.null
or not provided, will use cross-validation to split the dataset. E.g. if n_splits
is 5, the dataset will be split into 5 equal-sized parts. For five iterations four parts will then
be used for training and the remaining part for testing. If test_size
is a number between 0 and 1,
in contrast, validation is done using a shuffle-split approach. Here, instead of splitting the data into
n_splits
equal parts up front, in each iteration we randomize the data and sample a proportion equal
to test_size
to use for evaluation and the remaining rows for training.Values must be in the following range:Options
Properties
Properties
Forbidden
Min
Max
Ordered
Plain
params
.
Randomized search, on the other hand, randomly samples iterations
parameter combinations
from the distributions specified in params
.Values must be one of the following:grid
random
Properties
n_splits
times, train on the former
and evaluate on the latter using specified or automatically selected metrics
.null
or not provided, will use cross-validation to split the dataset. E.g. if n_splits
is 5, the dataset will be split into 5 equal-sized parts. For five iterations four parts will then
be used for training and the remaining part for testing. If test_size
is a number between 0 and 1,
in contrast, validation is done using a shuffle-split approach. Here, instead of splitting the data into
n_splits
equal parts up front, in each iteration we randomize the data and sample a proportion equal
to test_size
to use for evaluation and the remaining rows for training.Values must be in the following range:Options
accuracy
balanced_accuracy
explained_variance
f1_micro
f1_macro
f1_samples
f1_weighted
neg_mean_squared_error
neg_median_absolute_error
neg_root_mean_squared_error
precision_micro
precision_macro
precision_samples
precision_weighted
recall_micro
recall_macro
recall_samples
recall_weighted
r2