Real world texts are full of abbreviations, typos and slang, and so a single concept can often be written in many different ways. For example, asking people to use free-form text to indicate their job role in a questionaire may result in tens of different ways to describe the same role (e.g. “Analista Programador”, “analista / programador”, “anilist y programmador” etc.). This step attempts to clean up data of this type by merging categories with sufficiently similar spelling.

Behind the scenes the step uses Chars2vec, a library that employs recurrent neural networks to calculate character-based word embeddings. That is, it transforms words into numeric vectors (embeddings) whose distance from another indicates the similarity of the words’ spelling.

We use chars2vec embeddings here to first identify sufficiently similar clusters of categories (using Agglomerative Clustering). The output of the step then is simply a new column where each original category has been replaced by the most common spelling in the same cluster.

Usage

The following example shows how the step can be used in a recipe.

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).