merge_similar_spellings
Group categories with similar spellings.
Real world texts are full of abbreviations, typos and slang, and so a single concept can often be written in many different ways. For example, asking people to use free-form text to indicate their job role in a questionaire may result in tens of different ways to describe the same role (e.g. “Analista Programador”, “analista / programador”, “anilist y programmador” etc.). This step attempts to clean up data of this type by merging categories with sufficiently similar spelling.
Behind the scenes the step uses Chars2vec, a library that employs recurrent neural networks to calculate character-based word embeddings. That is, it transforms words into numeric vectors (embeddings) whose distance from another indicates the similarity of the words’ spelling.
We use chars2vec embeddings here to first identify sufficiently similar clusters of categories (using Agglomerative Clustering). The output of the step then is simply a new column where each original category has been replaced by the most common spelling in the same cluster.
Usage
The following example shows how the step can be used in a recipe.
The following configuration applies the algorithm with the default values:
Inputs & Outputs
The following are the inputs expected by the step and the outputs it produces. These are generally
columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Configuration
The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output)
.
Only words/categories with more characters than this will potentially be merged. You may not want to merge short words, even if they’re very similar (e.g. cat and cut). Set this to 1 to potentially merge all categories.
Values must be in the following range:
Only words/categories with fewer characters than this will potentially be merged. Set this to null to include all categories in the algorithm.
Values must be in the following range:
Whether or not to remove/convert all non-alphanumeric characters from categories before attempting to merge.
Any category label with a greater proportion of numeric digits than this will be excluded from merging. Set to 1 to include all categories in the algorithm.
Values must be in the following range:
Split input texts into words at this character.
A string or regular expression identifying parts of text to be ignored when deciding which categories to merge.
Whether to penalize similarities depending on the length (#characters) of category labels.
The longer a category label, the less influence individual characters (and therefore small changes in spelling)
will have when comparing categories. This may lead to longer category labels being merged when they shouldn’t.
Setting "penalty": true
will make this less likely.
Maximum distance between groups of categories (embeddings) to be merged into the same cluster. Also see parameters in scikit-learn’s Agglomerative Clustering.
Values must be in the following range:
Which linkage criterion to use in the clustering. The distance measured between clusters of category embeddings to decide whether or not to merge them.
Also see parameters in scikit-learn’s Agglomerative Clustering.
Values must be one of the following:
single
ward
complete
average
Metric used to compute the distance between category embeddings in the clustering. Also see parameters in scikit-learn’s Agglomerative Clustering.
Values must be one of the following:
euclidean
l1
l2
manhattan
cosine
Was this page helpful?