Eqivalent to cluster_dataset, but instead of a dataset expects a column of embeddings as input. The input may e.g. be word2vec embeddings from an embed_text step, or whole dataset embeddings from an embed_dataset step.

Optionally reduces the dimensionality of the embeddings (by default using UMAP). This may help with making the data denser (counteracting the “curse-of-dimensionality”), and thus making it potentially easier to identify clusters.

The clustering algorithm used by default is (HDBSCAN), which produces a column of positive cluster IDs, or -1 if a data point is considered noise (not belonging to any cluster).

For further detail on HDBSCAN’s parameters see its documentation here (for usage) and here (for its API).

metric
string
default:
"euclidean"

The metric used to calculate similarity between data points.

Values must be one of the following:

euclidean manhattan chebyshev minkowski canberra braycurtis haversine mahalanobis wminkowski seuclidean cosine correlation hamming jaccard dice russellrao kulsinski rogerstanimoto sokalmichener sokalsneath yule

algorithm
string
default:
"hdbscan"

Algorithm to use. The name of a supported clustering algorithm (currently allows "hdbscan" only).

Values must be one of the following:

  • hdbscan
min_cluster_size
integer
default:
"120"

Minimum cluster size. The minimum size for considering a region of dense data points a proper cluster.

Values must be in the following range:

1 ≤ min_cluster_size < inf
min_samples
integer
default:
"15"

The larger the value, the more conservative the clustering. More points will be declared as noise, and clusters will be restricted to progressively more dense areas.

Values must be in the following range:

1 ≤ min_samples < inf
reduce
[object, null]

Umap configuration. See more here. Params for dimensionality reduction.