trim_frequencies

Affected categories are replaced with the missing value (NaN).

Usage

The following examples show how the step can be used in a recipe.

Examples

To remove categories ocurring fewer than 2 times in the column cat_col:

trim_frequencies(ds.cat_col, {"freq_min": 2}) -> (ds.cat_trimmed)

To remove categories ocurring fewer than 2 times in the column cat_col:

trim_frequencies(ds.cat_col, {"freq_min": 2}) -> (ds.cat_trimmed)

To only keep the 10 most frequent categories in column cat_col:

trim_frequencies(ds.cat_col, {"n_most_common": 10}) -> (ds.cat_top10)

General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.

trim_frequencies(input: category|list[category], {
    "param": value,
    ...
}) -> (output: column)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Inputs

Outputs

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

n_most_common

[integer, null]

The number N indicating how many of the most common values to filter (in descending order).

Values must be in the following range:

0 ≤ n_most_common < inf

freq_min

[integer, null]

Values with a lower frequency (count) than this will be removed.

Values must be in the following range:

1 ≤ freq_min < inf

freq_max

[integer, null]

Values with a higher frequency (count) than this will be removed.

Values must be in the following range:

1 ≤ freq_max < inf

Prepare

Report

Analyse

Usage

Inputs & Outputs

Configuration

Prepare

Report

Analyse

​Usage

​Inputs & Outputs

​Configuration

Usage

Inputs & Outputs

Configuration