Applies a clustering algorithm after vectorizing the input dataset (converting its columns to numeric-only and no missing data), and optionally reducing its dimensionality.

Essentially applies the separate step vectorize_dataset, followed by a clustering algorithm (HDBSCAN by default). The result is a column of cluster IDs.

For further detail on HDBSCAN’s parameters see its documentation here (for usage) and here (for its API).

Usage

The following example shows how the step can be used in a recipe.

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).