Skip to content

Merge similar semantics

NLP · text · word2vec · GloVe · vectorize · model · clustering

Group categories with similar meanings.

This step calculates embeddings for each category using GloVe vectors provided by spaCy's models. As similar words will have similar embeddings, we use them to cluster the categories, obtaining new categories that groups the original ones.

Example

The following configuration applies the algorithm with the default values:

merge_similar_semantics(ds.categories, ds.lang) -> (ds.new_categories)

Usage

The following are the step's expected inputs and outputs and their specific types.

merge_similar_semantics(
    col: category|text|list[category],
    language: category, 
    {
        "param": value
    }
) -> (categories: column)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Inputs


col: column:category|text|list[category]

Column with categories to merge.


language: column:category

Outputs


categories: column

Column containing merged categories.

Parameters


distance_threshold: number = 0.1

After hierarchically clustering all categories, clusters of categories closer than this distance will be merged into one. Also see parameters in scikit-learn's Agglomerative Clustering.

Range: 0 ≤ distance_threshold ≤ 1


linkage: string = "single"

Which linkage criterion to use in the clustering. While the distance metric applied is always the cosine between category embeddings, this parameter determines how to calculate the distance between clusters of embeddings, e.g. selecting the maximum distance between categories in two clusters ("complete"), the minimum ("single") etc.

Also see parameters in scikit-learn's Agglomerative Clustering.

Must be one of: "single", "ward", "complete", "average"