Train and store a classification model to be loaded at a later point for prediction.
Examples
model
parameter (see below for details):ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Inputs
Outputs
step(..., {"param": "value", ...}) -> (output)
.
Parameters
null
, true
or "native"
) uses the classifier’s native feature importance measure, e.g.
prediction-value-change in the case
of Catboost, Gini importance
in the case of scikit-learn’s DecisionTreeClassifier, and the mean of absolute coefficients in the case of
logistic regression.When set to "permutation"
, uses permutation importance,
i.e. measures the decrease in model score when a single feature’s values are randomly shuffled. This is
considerably slower than native feature importance (the model needs to be evaluated an additional k*n times,
where k is the number of features and n the number of repetitions to average over). On the positive side it is
model-agnostic and doesn’t suffer from bias towards high cardinality features (like some tree-based feature
importances). On the negative side, it can be sensitive to strongly correlated features, as the unshuffled
correlated variable is still available to the model when shuffling the original variable.When set to false
, no feature importance will be calculated.Values must be one of the following:True
False
native
permutation
null
feature_encoder
option below.null
), Graphext chooses automatically how to convert any column types the model
may not understand natively to a numeric type.A configuration object can be passed instead to overwrite specific parameter values with respect
to their default values.Properties
Properties
Mean
Median
MostFrequent
Const
None
Standard
Robust
KNN
None
scaler
function.
Details depend no the particular scaler used.Options
MostFrequent
Const
None
OneHot
Label
Ordinal
Binary
Frequency
None
Standard
Robust
KNN
None
list[category]
for short). May contain either a single configuration for
all multilabel variables, or two different configurations for low- and high-cardinality variables.
For further details pick one of the two options below.Options
Binarizer
TfIdf
None
Euclidean
KNN
Norm
None
Properties
Array items
day
dayofweek
dayofyear
hour
minute
month
quarter
season
second
week
weekday
weekofyear
year
Array items
day
dayofweek
dayofyear
hour
month
Mean
Median
MostFrequent
Const
None
Standard
Robust
KNN
None
Euclidean
KNN
Norm
None
list[number]
for short).include_text_features
below to active it.Properties
Euclidean
KNN
Norm
None
embed_text
or embed_text_with_model
.Properties
Forbidden
Min
Max
Ordered
Plain
null
is equivalent to 1.0 (all features). You can set this to values < 1.0 when the dataset has many features (e.g. > 20) to speed up training.Values must be in the following range:Balanced
SqrtBalanced
None
None
Properties
n_splits
times, train on the former
and evaluate on the latter using specified or automatically selected metrics
.Values must be in the following range:null
or not provided, will use k-fold cross-validation
to split the dataset. E.g. if n_splits
is 5, the dataset will be split into 5 equal-sized parts.
For five iterations four parts will then be used for training and the remaining part for testing.
If test_size
is a number between 0 and 1, in contrast, validation is done using a
shuffle-split
approach. Here, instead of splitting the data into n_splits
equal parts up front, in each iteration
we randomize the data and sample a proportion equal to test_size
to use for evaluation and the remaining
rows for training.Values must be in the following range:Array items
accuracy
balanced_accuracy
f1_micro
f1_macro
f1_samples
f1_weighted
precision_micro
precision_macro
precision_samples
precision_weighted
recall_micro
recall_macro
recall_samples
recall_weighted
roc_auc
roc_auc_ovr
roc_auc_ovo
roc_auc_ovr_weighted
roc_auc_ovo_weighted
Properties
params
.
Randomized search, on the other hand, randomly samples iterations
parameter combinations
from the distributions specified in params
.Values must be one of the following:grid
random
Properties
n_splits
times, train on the former
and evaluate on the latter using specified or automatically selected metrics
.Values must be in the following range:null
or not provided, will use k-fold cross-validation
to split the dataset. E.g. if n_splits
is 5, the dataset will be split into 5 equal-sized parts.
For five iterations four parts will then be used for training and the remaining part for testing.
If test_size
is a number between 0 and 1, in contrast, validation is done using a
shuffle-split
approach. Here, instead of splitting the data into n_splits
equal parts up front, in each iteration
we randomize the data and sample a proportion equal to test_size
to use for evaluation and the remaining
rows for training.Values must be in the following range:Array items
accuracy
balanced_accuracy
f1_micro
f1_macro
f1_samples
f1_weighted
precision_micro
precision_macro
precision_samples
precision_weighted
recall_micro
recall_macro
recall_samples
recall_weighted
roc_auc
roc_auc_ovr
roc_auc_ovo
roc_auc_ovr_weighted
roc_auc_ovo_weighted
accuracy
balanced_accuracy
f1_micro
f1_macro
f1_samples
f1_weighted
precision_micro
precision_macro
precision_samples
precision_weighted
recall_micro
recall_macro
recall_samples
recall_weighted
roc_auc
roc_auc_ovr
roc_auc_ovo
roc_auc_ovr_weighted
roc_auc_ovo_weighted
"depth": [3, 5, 7]
.Properties
Array items
Array items
Array items
Array items
Array items
Array items
Array items
time_split
parameter in validation
configurations). If the data is already sorted by time, you can ignore this parameter.Properties
Array items
true
is provided, or no value is specified,
all columns will be sorted in ascending order. If a single false
is provided, all columns will be sorted in descending
order. If an array of booleans is provided, each column will be sorted according to the corresponding boolean value.Array items