Density-based clustering with “HDBSCAN”

Generates a hierarchy of clusters, but then automatically selects the best flat clustering based on the stability of clusters across a range of density thresholds. Roughly speaking, if a cluster’s subclusters persists over a larger range of the density parameter then the parent cluster itself, the subclusters will be selected, otherwise the parent. The main parameter influencing cluster selection is min_cluster_size.

Can be used to predict the clusters of new data without changing the existing clustering.

encode_features
boolean
default: "true"

Toggle encoding of feature columns. When enabled, Graphext will auto-convert any column types to the numeric type before fitting the model. How this conversion is done can be configured using the feature_encoder option below.

If disabled, any model trained in this step will assume that input data is already in an appropriate format (e.g. numerical and not containing any missing values).
feature_encoder
[null, object]

Configures encoding of feature columns. By default (null), Graphext chooses automatically how to convert any column types the model may not understand natively to a numeric type.

A configuration object can be passed instead to overwrite specific parameter values with respect to their default values.

include_text_features
boolean

Whether to include or ignore text columns during the processing of input data. Enabling this will convert texts to their TfIdf representation. Each text will be converted to an N-dimensional vector in which each component measures the relative “over-representation” of a specific word (or n-gram) relative to its overall frequency in the whole dataset. This is disabled by default because it will often be better to convert texts explicitly using a previous step, such as embed_text or embed_text_with_model.

params
object

Model parameters. Also see official HDBSCAN documentation for details.

seed
integer

Seed for random number generator ensuring reproducibility.

Values must be in the following range:

0 ≤ seed < inf