Eqivalent to cluster_dataset, but instead of a dataset expects a column of embeddings as input. The input may e.g. be word2vec embeddings from an embed_text step, or whole dataset embeddings from an embed_dataset step.

Optionally reduces the dimensionality of the embeddings (by default using UMAP). This may help with making the data denser (counteracting the “curse-of-dimensionality”), and thus making it potentially easier to identify clusters.

The clustering algorithm used by default is (HDBSCAN), which produces a column of positive cluster IDs, or -1 if a data point is considered noise (not belonging to any cluster).

For further detail on HDBSCAN’s parameters see its documentation here (for usage) and here (for its API).

Usage

The following example shows how the step can be used in a recipe.

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).