Density-based clustering with “HDBSCAN”

Generates a hierarchy of clusters, but then automatically selects the best flat clustering based on the stability of clusters across a range of density thresholds. Roughly speaking, if a cluster’s subclusters persists over a larger range of the density parameter then the parent cluster itself, the subclusters will be selected, otherwise the parent. The main parameter influencing cluster selection is min_cluster_size.

Can be used to predict the clusters of new data without changing the existing clustering.

Usage

The following example shows how the step can be used in a recipe.

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).