aggregate
Group and aggregate a dataset using any of a number of predefined functions.
After optionally sorting the dataset, it is grouped by the unique values (or combinations of unique values) in one or more columns. Each group’s rows are then aggregated using one or more predefined functions. A new dataset is thus created containing one column per selected aggregation function, and one row for each unique group.
Usage
The following example shows how the step can be used in a recipe.
Given an online retail dataset products
, where rows represent items with id product_id
, and which have
been added to a shopping basket at time time_added
, we can aggregate these items into a new dataset baskets
containing one row per basket. The following configuration calculates this aggregation, creating a new dataset
with three columns:
products
: a list of all items in a given basket, preserving the order they were addedsize
: the number of items in the baskettotal
: the total value of the basket
Given an online retail dataset products
, where rows represent items with id product_id
, and which have
been added to a shopping basket at time time_added
, we can aggregate these items into a new dataset baskets
containing one row per basket. The following configuration calculates this aggregation, creating a new dataset
with three columns:
products
: a list of all items in a given basket, preserving the order they were addedsize
: the number of items in the baskettotal
: the total value of the basket
General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.
Inputs & Outputs
The following are the inputs expected by the step and the outputs it produces. These are generally
columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Configuration
The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output)
.
Grouping column(s). The name(s) of column(s) whose unique values define the groups to aggregate.
Pre-aggregation row sorting.
Sort the dataset rows before aggregating, e.g. when in a particular aggregation function (such as list
) the
encountered order is important.
Definition of desired aggregations.
A dictionary mapping original columns to new aggregated columns, specifying an aggregation function for each.
Aggregations are functions that reduce all the values in a particular column of a single group to a single summary value of that group.
E.g. a sum
aggregation of column A calculates a single total by adding up all the values in A belonging to each group.
Possible aggregations functions accepted as func
parameters are:
n
,size
orcount
: calculate number of rows in groupsum
: sum total of valuesmean
: take mean of valuesmax
: take max of valuesmin
: take min of valuesmode
: find most frequent value (returns first mode if multiple exist)first
: take first item foundlast
: take last item foundunique
: collect a list of unique valuesn_unique
: count the number of unique valueslist
: collect a list of all valuesconcatenate
: convert all values to text and concatenate them into one long textconcat_lists
: concatenate lists in all rows into a single larger listcount_where
: number of rows in which the column matches a value, needs parametervalue
with the value that you want to countpercent_where
: percentage of the column where the column matches a value, needs parametervalue
with the value that you want to count
Note that in the case of count_where
and percent_where
an additional value
parameter is required.
Whether to ignore missing values (NaNs) in group columns.
If false
(default), missing values (NaNs) will be grouped together in their own group. Otherwise, rows
containing NaNs in the group column will be ignored.
Whether to sort groups by values in the grouping columns.
This doesn’t affect sorting of rows within groups, which is always maintained (and may depend on the
presort
parameter), but only the ordering amongst groups. If the order of groups is not important,
leaving this off will usually result in faster execution (false
by default) .
Enforce use of Pandas aggregation. Normally, depending on dataset size, the step will automatically switch between Pandas and Dask aggregation, preferring whichever represents a better trade-off between execution-time and memory usage. For very large datasets, Dask is the only viable method, but Dask has limitations when it comes to sorting. For intermediate dataset sizes, and if you need to sort the dataset before aggregation on more than a single column, you can try enforcing the use of Pandas if otherwise you see warning or errors related to sorting.