Train and store a machine learning model to be loaded at a later point for prediction.
link_embeddings
in the latter case).
Can be used in supervised mode (providing a target
column as parameter) or unsupervised (without target).
The output will always be a new column with the trained model’s predictions on the training data,
as well as a saved and named model file that can be used in other projects for prediction of new data.
Examples
ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Inputs
Outputs
step(..., {"param": "value", ...}) -> (output)
.
Parameters
feature_encoder
option below.null
), Graphext chooses automatically how to convert any column types the model
may not understand natively to a numeric type.A configuration object can be passed instead to overwrite specific parameter values with respect
to their default values.Properties
Properties
Mean
Median
MostFrequent
Const
None
Standard
Robust
KNN
None
scaler
function.
Details depend no the particular scaler used.Options
MostFrequent
Const
None
OneHot
Label
Ordinal
Binary
Frequency
None
Standard
Robust
KNN
None
list[category]
for short). May contain either a single configuration for
all multilabel variables, or two different configurations for low- and high-cardinality variables.
For further details pick one of the two options below.Options
Binarizer
TfIdf
None
Euclidean
KNN
Norm
None
Properties
Array items
day
dayofweek
dayofyear
hour
minute
month
quarter
season
second
week
weekday
weekofyear
year
Array items
day
dayofweek
dayofyear
hour
month
Mean
Median
MostFrequent
Const
None
Standard
Robust
KNN
None
Euclidean
KNN
Norm
None
list[number]
for short).include_text_features
below to active it.Properties
Euclidean
KNN
Norm
None
embed_text
or embed_text_with_model
.Properties
euclidean
manhattan
chebyshev
minkowski
canberra
braycurtis
haversine
mahalanobis
wminkowski
seuclidean
cosine
correlation
hamming
jaccard
dice
russellrao
kulsinski
rogerstanimoto
sokalmichener
sokalsneath
yule
null
is specified a value will be selected based on the size of the input dataset
(200 for large datasets, 500 for small).n_components
from a principal component analysis. “tswspectral” is a cheaper alternative to “spectral”. When “random”,
assigns initial embedding positions at random. This uses the least amount of memory and time but may make UMAP
slower to converge on the optimal embedding.Values must be one of the following:spectral
pca
tswspectral
random
true
. This approach is more computationally expensive, but avoids excessive memory use.