Skip to content

Merge similar spellings

NLP · text · vectorize · chars2vec · model · clustering

Group categories with similar spellings.

Real world texts are full of abbreviations, typos and slang, and so a single concept can often be written in many different ways. For example, asking people to use free-form text to indicate their job role in a questionaire may result in tens of different ways to describe the same role (e.g. "Analista Programador", "analista / programador", "anilist y programmador" etc.). This step attempts to clean up data of this type by merging categories with sufficiently similar spelling.

Behind the scenes the step uses Chars2vec, a library that employs recurrent neural networks to calculate character-based word embeddings. That is, it transforms words into numeric vectors (embeddings) whose distance from another indicates the similarity of the words' spelling.

We use chars2vec embeddings here to first identify sufficiently similar clusters of categories (using Agglomerative Clustering). The output of the step then is simply a new column where each original category has been replaced by the most common spelling in the same cluster.

Example

The following configuration applies the algorithm with the default values:

merge_similar_spellings(ds.categories) -> (ds.merged_categories)

Usage

The following are the step's expected inputs and outputs and their specific types.

merge_similar_spellings(col: category|text|list[category]|list[text], {"param": value}) -> (categories: list[category])

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Inputs


col: column:category|text|list[category]|list[text]

Column with categories to merge.

Outputs


categories: column:list[category]

Column containing merged categories.

Parameters


min_length: integer = 4

Only words/categories with more characters than this will potentially be merged. You may not want to merge short words, even if they're very similar (e.g. cat and cut). Set this to 1 to potentially merge all categories.

Range: 1 ≤ min_length < inf


max_length: integer | null = 50

Only words/categories with fewer characters than this will potentially be merged. Set this to null to include all categories in the algorithm.

Range: 1 ≤ max_length < inf


alpha_numeric: boolean = False

Whether or not to remove/convert all non-alphanumeric characters from categories before attempting to merge.


numeric_threshold: number = 0.4

Any category label with a greater proportion of numeric digits than this will be excluded from merging. Set to 1 to include all categories in the algorithm.

Range: 0 ≤ numeric_threshold ≤ 1


separator: string | null

Split input texts into words at this character.


ignore_substr: string | null

A string or regular expression identifying parts of text to be ignored when deciding which categories to merge.


penalty: boolean = False

Whether to penalize similarities depending on the length (#characters) of category labels. The longer a category label, the less influence individual characters (and therefore small changes in spelling) will have when comparing categories. This may lead to longer category labels being merged when they shouldn't. Setting "penalty": true will make this less likely.


distance_threshold: number = 1

Maximum distance between groups of categories (embeddings) to be merged into the same cluster. Also see parameters in scikit-learn's Agglomerative Clustering.

Range: 0 ≤ distance_threshold < inf


linkage: string = "complete"

Which linkage criterion to use in the clustering. The distance measured between clusters of category embeddings to decide whether or not to merge them.

Also see parameters in scikit-learn's Agglomerative Clustering.

Must be one of: "single", "ward", "complete", "average"


affinity: string = "euclidean"

Metric used to compute the distance between category embeddings in the clustering. Also see parameters in scikit-learn's Agglomerative Clustering.

Must be one of: "euclidean", "l1", "l2", "manhattan", "cosine"