group_by
Group data by specified columns and apply aggregation functions to each group.
Usage
The following examples show how the step can be used in a recipe.
This example groups the dataset by an exact match on the category
column and a date component (month level) on the date
column, and then aggregates the count of sales
and the sum of revenue
:
This example groups the dataset by an exact match on the category
column and a date component (month level) on the date
column, and then aggregates the count of sales
and the sum of revenue
:
This example uses the simplified by
parameter to group by an exact match on category
. The aggregation calculates the average of revenue
for each group:
General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.
Inputs & Outputs
The following are the inputs expected by the step and the outputs it produces. These are generally
columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Configuration
The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output)
.
Columns to group by.
An array specifying the columns used for grouping. The by
parameter can be either:
- An array of column names (e.g.,
["column1", "column2"]
), which defaults toEXACT
grouping. - An array of objects with
by
,groupingType
, optionalname
and optionalparam
properties.
Aggregation functions to apply. An array specifying the aggregation functions to apply on each group. The array can be empty, in which case no aggregations are performed, but the dataset is still grouped by the specified columns.
The graphext advanced query used to identify the rows to select previous to the grouping.
Was this page helpful?