This step calculates embeddings for each category using GloVe vectors provided by spaCy’s models. As similar words will have similar embeddings, we use them to cluster the categories, obtaining new categories that groups the original ones.

distance_threshold
number
default: "0.1"

Determines which categories will be merged. After hierarchically clustering all categories, clusters of categories closer than this distance will be merged into one.

Also see details in scikit-learn’s Agglomerative Clustering.

Values must be in the following range:

0 ≤ distance_threshold ≤ 1
linkage
string
default: "single"

Which linkage criterion to use in the clustering. While the distance metric applied is always the cosine between category embeddings, this parameter determines how to calculate the distance between clusters of embeddings, e.g. selecting the maximum distance between categories in two clusters (“complete”), the minimum (“single”) etc.

Also see details in scikit-learn’s Agglomerative Clustering.

Values must be one of the following:

  • single
  • ward
  • complete
  • average