This is essentially aggregate after “exploding” a column of lists such that each list item has its own row. By default the step produces one row per unique list item, and two columns: the count of how many times each list item was encountered, and a column rows recording the row numbers of the lists in which the element was found ([1,3,7] would mean an item was present in the lists of rows 1, 3 and 7). In addition, predefined functions can be used to add further aggregations of the grouped input dataset.

For example, if a dataset contains texts already separated into lists of individual words, this step will create a new dataset containing one row per word, a column containing each word’s frequency (count) across all texts, and another column of lists indicating in which rows the word was found.

Optionally, if a grouping column is specified using the "by" parameter, otherwise identical items belonging to different groups will be counted separately. If the dataset contains texts in different languages, for example, one may not want to group all occurences of the same word together, irrespective of language. The word “angel” in German signifies a fishing rod, for example, “any” in Catalan means “year”, and the Italian word “burro” means “butter” while in Spanish it refers to “donkey”. Using language as the grouping column would preserve the word in each language as a separate group.

Usage

The following example shows how the step can be used in a recipe.

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).