merge_similar_spellings

Real world texts are full of abbreviations, typos and slang, and so a single concept can often be written in many different ways. For example, asking people to use free-form text to indicate their job role in a questionaire may result in tens of different ways to describe the same role (e.g. “Analista Programador”, “analista / programador”, “anilist y programmador” etc.). This step attempts to clean up data of this type by merging categories with sufficiently similar spelling. Behind the scenes the step uses Chars2vec, a library that employs recurrent neural networks to calculate character-based word embeddings. That is, it transforms words into numeric vectors (embeddings) whose distance from another indicates the similarity of the words’ spelling. We use chars2vec embeddings here to first identify sufficiently similar clusters of categories (using Agglomerative Clustering). The output of the step then is simply a new column where each original category has been replaced by the most common spelling in the same cluster.

Usage

The following example shows how the step can be used in a recipe.

Examples

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Inputs

Outputs

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

min_length

integer

default:"4"

Only words/categories with more characters than this will potentially be merged. You may not want to merge short words, even if they’re very similar (e.g. cat and cut). Set this to 1 to potentially merge all categories.Values must be in the following range:

1 ≤ min_length < inf

max_length

[integer, null]

default:"50"

Only words/categories with fewer characters than this will potentially be merged. Set this to null to include all categories in the algorithm.Values must be in the following range:

1 ≤ max_length < inf

alpha_numeric

boolean

default:"false"

Whether or not to remove/convert all non-alphanumeric characters from categories before attempting to merge.

numeric_threshold

number

default:"0.4"

Any category label with a greater proportion of numeric digits than this will be excluded from merging. Set to 1 to include all categories in the algorithm.Values must be in the following range:

0 ≤ numeric_threshold ≤ 1

separator

[string, null]

Split input texts into words at this character.

ignore_substr

[string, null]

A string or regular expression identifying parts of text to be ignored when deciding which categories to merge.

penalty

boolean

default:"false"

Whether to penalize similarities depending on the length (#characters) of category labels. The longer a category label, the less influence individual characters (and therefore small changes in spelling) will have when comparing categories. This may lead to longer category labels being merged when they shouldn’t. Setting "penalty": true will make this less likely.

distance_threshold

number

default:"1"

Maximum distance between groups of categories (embeddings) to be merged into the same cluster. Also see parameters in scikit-learn’s Agglomerative Clustering.Values must be in the following range:

0 ≤ distance_threshold < inf

linkage

string

default:"complete"

Which linkage criterion to use in the clustering. The distance measured between clusters of category embeddings to decide whether or not to merge them.Also see parameters in scikit-learn’s Agglomerative Clustering.Values must be one of the following:

single
ward
complete
average

affinity

string

default:"euclidean"

Metric used to compute the distance between category embeddings in the clustering. Also see parameters in scikit-learn’s Agglomerative Clustering.Values must be one of the following:

euclidean
l1
l2
manhattan
cosine

Prepare

Report

Analyse

Usage

Inputs & Outputs

Configuration

Prepare

Report

Analyse

​Usage

​Inputs & Outputs

​Configuration

Usage

Inputs & Outputs

Configuration