Applies a clustering algorithm after vectorizing the input dataset (converting its columns to numeric-only and no missing data), and optionally reducing its dimensionality.

Essentially applies the separate step vectorize_dataset, followed by a clustering algorithm (HDBSCAN by default). The result is a column of cluster IDs.

For further detail on HDBSCAN’s parameters see its documentation here (for usage) and here (for its API).

metric
string
default: "euclidean"

The metric used to calculate similarity between data points.

Values must be one of the following:

euclidean manhattan chebyshev minkowski canberra braycurtis haversine mahalanobis wminkowski seuclidean cosine correlation hamming jaccard dice russellrao kulsinski rogerstanimoto sokalmichener sokalsneath yule

algorithm
string
default: "hdbscan"

Algorithm to use. The name of a supported clustering algorithm (currently allows "hdbscan" only).

Values must be one of the following:

  • hdbscan
min_cluster_size
integer
default: "120"

Minimum cluster size. The minimum size for considering a region of dense data points a proper cluster.

Values must be in the following range:

1 ≤ min_cluster_size < inf
min_samples
integer
default: "15"

The larger the value, the more conservative the clustering. More points will be declared as noise, and clusters will be restricted to progressively more dense areas.

Values must be in the following range:

1 ≤ min_samples < inf
reduce
[object, null]

Umap configuration. See more here. Params for dimensionality reduction.