Introduction
UMAP and our proprietary k-NN graph treat the relative balance between categorical and numeric data differently, leading most of the time to qualitatively different embeddings. The k-NNG method has a tendency to clearly separate the network into clusters such that each cluster corresponds to a combination of categories (across different variables). UMAP tends to not divide the network as sharply based on categorical columns, yet can be configured to give them more or less influence on the resulting embedding.Example 1 – Human resources data
As a first example, we use a human resources dataset with 10 numeric and 2 categorical columns. The categorical columns represent the salary of employees (3 levels) and their department (10 levels). We will also be showing two numeric columns: the number of projects (integer values from 2 to 7), and satisfaction level (continuous in [0, 1]).k-NNG embedding
The k-NNG method produces the following layout:



UMAP embedding
Using UMAP has the advantage of giving us more influence on the relative importance of categorical vs numeric variables. In Graphext, this can be done with thetype_weights parameter, e.g.
















Example 2 – Titanic dataset
As a another example we will look at (a subset of) the Titanic dataset, which for each passenger contains 3 categorical columns (sex, passenger class, and port of embarkment), and 4 numeric columns (age, number of siblings and spouses aboard, number of parents and children aboard, and fare price).k-NNG embedding
The k-NNG method produces the following layout (double-click on an image to see a bigger version):



UMAP embedding
For this dataset, using UMAP with default column weights seems to achieve a good balance between categorical and numerical variables out of the box (note that in the following figures we changed themin_dist, and n_epochs parameters only, as with default values the resulting layout is rather thinly spread):



