This transformation is useful when, for example, you have freeform text in a survey, where people will refer to the same concept in many different ways. This helps clean and tidy your data by providing a consistent representation of each concept.

Check out the merge_similar_spellings step for more information.

Parameters

  • Column: the column to search and group terms in
  • Strength threshold: a factor in the [0,1][0, 1] range to make the algorithm more or less sensitive. A value of 1 will merge all ocurrences, while a value closer to 0 will search for stronger correlation between the terms, thus being much more strict with the merging.