Usage
The following example shows how the step can be used in a recipe.Examples
Examples
The following configuration applies the algorithm with the default values:
Inputs & Outputs
The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Inputs
Inputs
Column with categories to merge.
Outputs
Outputs
Column containing merged categories.
Configuration
The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e.step(..., {"param": "value", ...}) -> (output)
.
Parameters
Parameters
Only words/categories with more characters than this will potentially be merged.
You may not want to merge short words, even if they’re very similar (e.g. cat and cut).
Set this to 1 to potentially merge all categories.Values must be in the following range:
Only words/categories with fewer characters than this will potentially be merged.
Set this to null to include all categories in the algorithm.Values must be in the following range:
Whether or not to remove/convert all non-alphanumeric characters from categories before attempting to merge.
Any category label with a greater proportion of numeric digits than this will be excluded from merging.
Set to 1 to include all categories in the algorithm.Values must be in the following range:
Split input texts into words at this character.
A string or regular expression identifying parts of text to be ignored when deciding which categories to merge.
Whether to penalize similarities depending on the length (#characters) of category labels.
The longer a category label, the less influence individual characters (and therefore small changes in spelling)
will have when comparing categories. This may lead to longer category labels being merged when they shouldn’t.
Setting
"penalty": true
will make this less likely.Maximum distance between groups of categories (embeddings) to be merged into the same cluster.
Also see parameters in scikit-learn’s Agglomerative Clustering.Values must be in the following range:
Which linkage criterion to use in the clustering.
The distance measured between clusters of category embeddings to decide whether or not to merge them.Also see parameters in scikit-learn’s Agglomerative Clustering.Values must be one of the following:
single
ward
complete
average
Metric used to compute the distance between category embeddings in the clustering.
Also see parameters in scikit-learn’s Agglomerative Clustering.Values must be one of the following:
euclidean
l1
l2
manhattan
cosine