train_clustering

Density-based clustering with “HDBSCAN” Generates a hierarchy of clusters, but then automatically selects the best flat clustering based on the stability of clusters across a range of density thresholds. Roughly speaking, if a cluster’s subclusters persists over a larger range of the density parameter then the parent cluster itself, the subclusters will be selected, otherwise the parent. The main parameter influencing cluster selection is min_cluster_size. Can be used to predict the clusters of new data without changing the existing clustering.

Usage

The following example shows how the step can be used in a recipe.

Examples

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Inputs

Outputs

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

encode_features

boolean

default:"true"

Toggle encoding of feature columns. When enabled, Graphext will auto-convert any column types to the numeric type before fitting the model. How this conversion is done can be configured using the feature_encoder option below.

If disabled, any model trained in this step will assume that input data is already in an appropriate format (e.g. numerical and not containing any missing values).

feature_encoder

[null, object]

Configures encoding of feature columns. By default (null), Graphext chooses automatically how to convert any column types the model may not understand natively to a numeric type.A configuration object can be passed instead to overwrite specific parameter values with respect to their default values.

Properties

number

object

Numeric encoder. Configures encoding of numeric features.

Properties

bool

object

Boolean encoder. Configures encoding of boolean features.

Properties

ordinal

object

Ordinal encoder. Configures encoding of categorical features that have a natural order.

Properties

category

[object, object]

Category encoder. May contain either a single configuration for all categorical variables, or two different configurations for low- and high-cardinality variables. For further details pick one of the two options below.

Options

indicate_missing

boolean

Toggle the addition of a column using 0s and 1s to indicate where an input column contained missing values.

imputer

[null, string]

Whether and how to impute (replace/fill) missing values.Values must be one of the following:

MostFrequent
Const
None

max_categories

[null, integer]

Maximum number of unique categories to encode. Only the N-1 most common categories will be encoded, and the rest will be grouped into a single “Others” category.Values must be in the following range:

1 ≤ max_categories < inf

encoder

[null, string]

How to encode categories.Values must be one of the following:OneHot Label Ordinal Binary Frequency None

scaler

[null, string]

Whether and how to scale the final numerical values (across a single column).Values must be one of the following:

Standard
Robust
KNN
None

multilabel

[object, object]

Multilabel encoder. Configures encoding of multivalued categorical features (variable length lists of categories, or the semantic type list[category] for short). May contain either a single configuration for all multilabel variables, or two different configurations for low- and high-cardinality variables. For further details pick one of the two options below.

Options

indicate_missing

boolean

Toggle the addition of a column using 0s and 1s to indicate where an input column contained missing values.

encoder

[null, string]

How to encode categories/labels in multilabel (list[category]) columns.Values must be one of the following:

Binarizer
TfIdf
None

max_categories

[null, integer]

Maximum number of categories/labels to encode. If a number is provided, the result of the encoding will be reduced to these many dimensions (columns) using scikit-learn’s truncated SVD. When applied together with (after a) Tf-Idf encoding, this performs a kind of latent semantic analysis.Values must be in the following range:

2 ≤ max_categories < inf

scaler

[null, string]

How to scale the encoded (numerical columns).Values must be one of the following:

Euclidean
KNN
Norm
None

datetime

object

Datetime encoder. Configures encoding of datetime (timestamp) features.

Properties

embedding

object

Embedding/vector encoder. Configures encoding of multivalued numerical features (variable length lists of numbers, i.e. vectors, or the semantic type list[number] for short).

Properties

text

object

Text encoder. Configures encoding of text (natural language) features. Currently only allows Tf-Idf embeddings to represent texts. If you wish to use other embeddings, e.g. semantic, Word2Vec etc., transform your text column first using another step, and then use that result instead of the original texts.

Texts are excluded by default from the overall encoding of the dataset. See parameter include_text_features below to active it.

Properties

indicate_missing

boolean

Toggle the addition of a column using 0s and 1s to indicate where an input column contained missing values.

encoder_params

object

Parameters to be passed to the text encoder (Tf-Idf parameters only for now). See scikit-learn’s documentation for detailed parameters and their explanation.

n_components

integer

How many output features to generate. The resulting Tf-Idf vectors will be reduced to these many dimensions (columns) using scikit-learn’s truncated SVD. This performs a kind of latent semantic analysis. By default we will reduce to 200 components.Values must be in the following range:

2 ≤ n_components ≤ 1024

scaler

[null, string]

How to scale the encoded (numerical columns).Values must be one of the following:

Euclidean
KNN
Norm
None

include_text_features

boolean

default:"false"

Whether to include or ignore text columns during the processing of input data. Enabling this will convert texts to their TfIdf representation. Each text will be converted to an N-dimensional vector in which each component measures the relative “over-representation” of a specific word (or n-gram) relative to its overall frequency in the whole dataset. This is disabled by default because it will often be better to convert texts explicitly using a previous step, such as embed_text or embed_text_with_model.

params

object

Model parameters. Also see official HDBSCAN documentation for details.

Properties

min_cluster_size

integer

default:"50"

The minimum size of clusters. Intuitively, the smallest size grouping you wish to consider a cluster. When selecting a flat clustering from the cluster hierarchy, splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters.Values must be in the following range:

1 ≤ min_cluster_size < inf

min_samples

integer

default:"5"

Determines how conservative the clustering is. The larger the value, the more points will be declared as noise, and clusters will be restricted to progressively more dense areas.Values must be in the following range:

1 ≤ min_samples < inf

cluster_selection_epsilon

number

default:"0.0"

Distance threshold. Clusters below this value will be merged. If default parameters result in areas with a large number of micro-clusters, this parameter can help merging these clusters together. For example, set the value to 0.5 if you don’t want to separate clusters that are less than 0.5 units apart (the distance distribution depends on your specific data).Values must be in the following range:

0.0 ≤ cluster_selection_epsilon < inf

cluster_selection_method

string

default:"eof"

Method used to select clusters from the cluster hierarchy. The default, “excess of mass” (eom), can sometimes pick one or two large clusters and then a number of small extra clusters. If you’re interested in a more fine-grained clustering with a larger number of more homogeously sized clusters, you may prefer selecting leaf clustering (leaf).Values must be one of the following:

eof
leaf

seed

integer

Seed for random number generator ensuring reproducibility.Values must be in the following range:

0 ≤ seed < inf

Prepare

Report

Analyse

Usage

Inputs & Outputs

Configuration

Prepare

Report

Analyse

​Usage

​Inputs & Outputs

​Configuration

Usage

Inputs & Outputs

Configuration