Figure 1: k-NNG flowchart. First, each row is associated with no more than the k most similar rows in the dataset, creating a (still abstract) graph. A layout is then calculated that positions the graph nodes in two dimensions (x/y) such that similar, connected rows are positioned close to each other. In a last step, the graph is clustered such that closely related groups of rows can be more easily identified.
link_similar_rows
creates the abstract graph. Given an input dataset, the step’s output is a new dataset (links
) in which each row is a link connecting one row to another. It has 3 columns: source, target and weight. The first two refer to the ids of the rows that are connected by the link (row numbers), while the weight column contains the numerical value of the similarity between the two rows.
To calculate node x/y positions we use the step layout_network
. This takes the just generated links dataset, and adds the x and y columns to the original rows, such that each row/node now has an associated position.
Finally, we detect clusters in the graph with cluster_network
. This step also receives the links dataset as input, and returns a simple categorical column containing the cluster labels.
n_similar_docs
parameter in link_similar_rows
.link_similar_rows
step.
Figure 2: Dimensionality reduction flow-chart. A dataset is reduced first to two dimensions to generate a layout (x/y positions). It is then reduced a second time to ~10 dimensions. This latter dataset is then used to calculate a clustering (using e.g. HDBSCAN), as well as to identify each row’s nearest neighbors, which allows linking the laid out nodes into a connected graph.
layout_network
is used to reduce the input dataset to 2 dimensions (x/y) using UMAP. Next, embed_dataset
employs UMAP again to create 10-dimensional embeddings. In the resulting output column, each original dataset row is now represented by a .
Having embedded the dataset in 10 dimensions, we then use cluster_embeddings
with the HDBSCAN algorithm by default to identify clusters in the dataset. Finally, link_embeddings
is used to calculate for each embedded row its k nearest neighbours. The output is again a dataset of links indicating which nodes should be connected as well as how similar they are quantitatively.
random_state
can be set to null ({“random_state”: null}
). This will enable parallel execution of the UMAP algorithm. This is still an experimental feature and so may or may not result in faster performance. It will, mean, however, that repeated executions of the step, even with identical input data, may produce different results. Note that these differences are usually negligible, i.e. they will not change any of the general structures inherent in the data, and therefore won’t usually affect any insights that can be derived from it.{“init”: null}
, which should be faster, at the cost of potentially being less precise.n_epochs
(which is 200 by default for large datasets and 500 for smaller ones). Setting this to smaller values means the algorithm will have less time to push clusters apart, but also less time to converge on the “optimal” solution.
Figure 3: The Iris dataset laid out with UMAP and nodes colored using the true label (species). One can see easily that almost all nodes of the same species fall into a particular region/cluster. The few nodes that get mixed up with other species correspond to exactly those examples of their species having unusual characteristics that make them appear similar to another.
Figure 4: Selected HDBSCAN clusters in the Iris project, and the number of nodes belonging to each species. Cluster 0 has only nodes of the “setosa” species, while cluster 2 almost only contains nodes with the “versicolor” label.
Example projects