merge_similar_semantics

This step calculates embeddings for each category using GloVe vectors provided by spaCy’s models. As similar words will have similar embeddings, we use them to cluster the categories, obtaining new categories that groups the original ones.

Usage

The following example shows how the step can be used in a recipe.

Examples

The following configuration applies the algorithm with the default values:

merge_similar_semantics(ds.categories, ds.lang) -> (ds.new_categories)

The following configuration applies the algorithm with the default values:

merge_similar_semantics(ds.categories, ds.lang) -> (ds.new_categories)

General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.

merge_similar_semantics(col: category|text|list[category], language: category, {
    "param": value,
    ...
}) -> (categories: column)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Inputs

Outputs

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

distance_threshold

number

default:"0.1"

Determines which categories will be merged. After hierarchically clustering all categories, clusters of categories closer than this distance will be merged into one.

Also see details in scikit-learn’s Agglomerative Clustering.

Values must be in the following range:

0 ≤ distance_threshold ≤ 1

linkage

string

default:"single"

Which linkage criterion to use in the clustering. While the distance metric applied is always the cosine between category embeddings, this parameter determines how to calculate the distance between clusters of embeddings, e.g. selecting the maximum distance between categories in two clusters (“complete”), the minimum (“single”) etc.

Also see details in scikit-learn’s Agglomerative Clustering.

Values must be one of the following:

single
ward
complete
average

Prepare

Report

Analyse

Usage

Inputs & Outputs

Configuration

Prepare

Report

Analyse

​Usage

​Inputs & Outputs

​Configuration

Usage

Inputs & Outputs

Configuration