Real world texts are full of abbreviations, typos and slang, and so a single concept can often be written in many different ways. For example, asking people to use free-form text to indicate their job role in a questionaire may result in tens of different ways to describe the same role (e.g. “Analista Programador”, “analista / programador”, “anilist y programmador” etc.). This step attempts to clean up data of this type by merging categories with sufficiently similar spelling.

Behind the scenes the step uses Chars2vec, a library that employs recurrent neural networks to calculate character-based word embeddings. That is, it transforms words into numeric vectors (embeddings) whose distance from another indicates the similarity of the words’ spelling.

We use chars2vec embeddings here to first identify sufficiently similar clusters of categories (using Agglomerative Clustering). The output of the step then is simply a new column where each original category has been replaced by the most common spelling in the same cluster.

min_length
integer
default: "4"

Only words/categories with more characters than this will potentially be merged. You may not want to merge short words, even if they’re very similar (e.g. cat and cut). Set this to 1 to potentially merge all categories.

Values must be in the following range:

1 ≤ min_length < inf
max_length
[integer, null]
default: "50"

Only words/categories with fewer characters than this will potentially be merged. Set this to null to include all categories in the algorithm.

Values must be in the following range:

1 ≤ max_length < inf
alpha_numeric
boolean

Whether or not to remove/convert all non-alphanumeric characters from categories before attempting to merge.

numeric_threshold
number
default: "0.4"

Any category label with a greater proportion of numeric digits than this will be excluded from merging. Set to 1 to include all categories in the algorithm.

Values must be in the following range:

0 ≤ numeric_threshold ≤ 1
separator
[string, null]

Split input texts into words at this character.

ignore_substr
[string, null]

A string or regular expression identifying parts of text to be ignored when deciding which categories to merge.

penalty
boolean

Whether to penalize similarities depending on the length (#characters) of category labels. The longer a category label, the less influence individual characters (and therefore small changes in spelling) will have when comparing categories. This may lead to longer category labels being merged when they shouldn’t. Setting "penalty": true will make this less likely.

distance_threshold
number
default: "1"

Maximum distance between groups of categories (embeddings) to be merged into the same cluster. Also see parameters in scikit-learn’s Agglomerative Clustering.

Values must be in the following range:

0 ≤ distance_threshold < inf
linkage
string
default: "complete"

Which linkage criterion to use in the clustering. The distance measured between clusters of category embeddings to decide whether or not to merge them.

Also see parameters in scikit-learn’s Agglomerative Clustering.

Values must be one of the following:

  • single
  • ward
  • complete
  • average
affinity
string
default: "euclidean"

Metric used to compute the distance between category embeddings in the clustering. Also see parameters in scikit-learn’s Agglomerative Clustering.

Values must be one of the following:

  • euclidean
  • l1
  • l2
  • manhattan
  • cosine