merge_similar_spellings
Group categories with similar spellings.
Real world texts are full of abbreviations, typos and slang, and so a single concept can often be written in many different ways. For example, asking people to use free-form text to indicate their job role in a questionaire may result in tens of different ways to describe the same role (e.g. “Analista Programador”, “analista / programador”, “anilist y programmador” etc.). This step attempts to clean up data of this type by merging categories with sufficiently similar spelling.
Behind the scenes the step uses Chars2vec, a library that employs recurrent neural networks to calculate character-based word embeddings. That is, it transforms words into numeric vectors (embeddings) whose distance from another indicates the similarity of the words’ spelling.
We use chars2vec embeddings here to first identify sufficiently similar clusters of categories (using Agglomerative Clustering). The output of the step then is simply a new column where each original category has been replaced by the most common spelling in the same cluster.
Was this page helpful?