Reduce the dataset to 2 dimensions that can be mapped to x/y node positions.
vectorize_dataset
and embed_dataset
. The only difference being that the number of dimensions (output columns) is
fixed to 2 (corresponding to x and y positions).
Examples
ds
to purely numerical form, will reduce its dimensionality to just 2 using UMAP, and will save those 2 dimensions in the columns x
and y
. The way that the x and y coordinates are calculated via dimensionality reduction should preserve the similarity between original rows. I.e., rows that are similar in the original dataset should have coordinates close to each other.ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Inputs
Outputs
step(..., {"param": "value", ...}) -> (output)
.
Parameters
umap
feature_encoder
option below.null
), Graphext chooses automatically how to convert any column types the model
may not understand natively to a numeric type.A configuration object can be passed instead to overwrite specific parameter values with respect
to their default values.Properties
Properties
Mean
Median
MostFrequent
Const
None
Standard
Robust
KNN
None
scaler
function.
Details depend no the particular scaler used.Options
MostFrequent
Const
None
OneHot
Label
Ordinal
Binary
Frequency
None
Standard
Robust
KNN
None
list[category]
for short). May contain either a single configuration for
all multilabel variables, or two different configurations for low- and high-cardinality variables.
For further details pick one of the two options below.Options
Binarizer
TfIdf
None
Euclidean
KNN
Norm
None
Properties
Array items
day
dayofweek
dayofyear
hour
minute
month
quarter
season
second
week
weekday
weekofyear
year
Array items
day
dayofweek
dayofyear
hour
month
Mean
Median
MostFrequent
Const
None
Standard
Robust
KNN
None
Euclidean
KNN
Norm
None
list[number]
for short).include_text_features
below to active it.Properties
Euclidean
KNN
Norm
None
embed_text
or embed_text_with_model
.{"column_name": weight, ...}
items. Will be scaled using the
parameters weights_max
, and weights_exp
before being applied. So only the relative weight of
the columns is important here, not their absolute values.Item properties
"column_name": numeric_weight
pair.
Each column name must refer to an existing column in the dataset.Examples
{"date": 0.5, "age": 2}
"type": weight"
items. Will be scaled using the parameters
weights_max
, and weights_exp
before being applied. So only the relative weight of the columns
is important here, not their absolute values.Properties
Number
Datetime
Category
Ordinal
Embedding
(List[Number]
).Multilabel
(List[Category]
).weights_max
. This allows for a non-linear
mapping from input weights to those used eventually to multiply the normalized columns.euclidean
manhattan
chebyshev
minkowski
canberra
braycurtis
haversine
mahalanobis
wminkowski
seuclidean
cosine
correlation
hamming
jaccard
dice
russellrao
kulsinski
rogerstanimoto
sokalmichener
sokalsneath
yule
null
is specified a value will be
selected based on the size of the input dataset (200 for large datasets, 500 for small).n_components
from a principal component analysis. “tswspectral” is a cheaper alternative to “spectral”. When “random”,
assigns initial embedding positions at random. This uses the least amount of memory and time but may make UMAP
slower to converge on the optimal embedding. “auto” selects between “spectral” and “random” automatically
depending on the size of the dataset.Values must be one of the following:spectral
pca
tswspectral
random
auto
true
.
This approach is more computationally expensive, but avoids excessive memory use. Setting
it to “auto”, will enable this mode automatically depending on the size of the dataset.Values must be one of the following:True
False
auto
None
n_neighbors
you can have the identical data points lying
in different regions of your space. It also violates the definition of a metric. This option will
remove duplicates before embedding, and then map the original data points back to the reduced space. Duplicate
data points will be placed in the exact same location as the original data points."auto"
, will try to determine an appropriate scale taking into account the number of nodes.If set to null
, only changes calculated coordinates to ensure they’re within the allowed limits (16.000).Options