Trim frequencies¶
Remove values whose frequencies (counts) are above/below a given threshold.
Affected categories are replaced with the missing value (NaN).
Usage¶
The following are the step's expected inputs and outputs and their specific types.
trim_frequencies(input: category|list[category], {"param": value}) -> (output: column)
where the object {"param": value}
is optional in most cases and if present may contain any of the parameters described in the
corresponding section below.
Example¶
To remove categories ocurring fewer than 2 times in the column cat_col
:
trim_frequencies(ds.cat_col, {"freq_min": 2}) -> (ds.cat_trimmed)
More examples
To only keep the 10 most frequent categories in column cat_col
:
trim_frequencies(ds.cat_col, {"n_most_common": 10}) -> (ds.cat_top10)
Inputs¶
input: column:category|list[category]
A categorical column to trim.
Outputs¶
output: column
A categorical column with fewer categories than the input.
Parameters¶
n_most_common: integer | null
The number N indicating how many of the most common values to filter (in descending order).
Range: 0 ≤ n_most_common < inf
freq_min: integer | null
Values with a lower frequency (count) than this will be removed.
Range: 1 ≤ freq_min < inf
freq_max: integer | null
Values with a higher frequency (count) than this will be removed.
Range: 1 ≤ freq_max < inf