Skip to content

Trim frequencies

Remove values whose frequencies (counts) are above/below a given threshold.

Affected categories are replaced with the missing value (NaN).

Example

To remove categories ocurring fewer than 2 times in the column cat_col:

trim_frequencies(ds.cat_col, {"freq_min": 2}) -> (ds.cat_trimmed)
More examples

To only keep the 10 most frequent categories in column cat_col:

trim_frequencies(ds.cat_col, {"n_most_common": 10}) -> (ds.cat_top10)

Usage

The following are the step's expected inputs and outputs and their specific types.

trim_frequencies(input: category|list[category], {"param": value}) -> (output: column)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Inputs


input: column:category|list[category]

A categorical column to trim.

Outputs


output: column

A categorical column with fewer categories than the input.

Parameters


n_most_common: integer | null

The number N indicating how many of the most common values to filter (in descending order).

Range: 0 ≤ n_most_common < inf


freq_min: integer | null

Values with a lower frequency (count) than this will be removed.

Range: 1 ≤ freq_min < inf


freq_max: integer | null

Values with a higher frequency (count) than this will be removed.

Range: 1 ≤ freq_max < inf