Group data by specified columns and apply aggregation functions to each group.
The following examples show how the step can be used in a recipe.
Examples
This example groups the dataset by an exact match on the category
column and a date component (month level) on the date
column, and then aggregates the count of sales
and the sum of revenue
:
This example groups the dataset by an exact match on the category
column and a date component (month level) on the date
column, and then aggregates the count of sales
and the sum of revenue
:
This example uses the simplified by
parameter to group by an exact match on category
. The aggregation calculates the average of revenue
for each group:
This example groups by category
and creates a sorted list of values
based on the sortCol
column:
General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.
The following are the inputs expected by the step and the outputs it produces. These are generally
columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Inputs
The input dataset containing the columns to group by and apply aggregations on.
Outputs
A dataset containing the aggregated results based on the grouping operations.
The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output)
.
Parameters
Columns to group by.
An array specifying the columns used for grouping. The by
parameter can be either:
["column1", "column2"]
), which defaults to EXACT
grouping.by
, groupingType
, optional name
and optional param
properties.Array items
Column name to group by.
Column name to group by.
Column to group by.
Name of the output column. It is optional and defaults to the column name (by
parameter).
Type of grouping operation. The type of grouping operation. You can group by exact value match, a date component, a range of numerical values, or quantiles.
Values must be one of the following:
EXACT
DATE_COMPONENT
RANGE
QUANTILES
Grouping parameters.
Options
Date component to group by.
Date component to group by (only for DATE_COMPONENT
grouping type).
Values must be one of the following:
MILLISECOND
SECOND
MINUTE
HOUR
YEAR_DAY
MONTH_DAY
WEEK_DAY
WEEK
WEEK_OF_YEAR
MONTH
QUARTER
YEAR
Timezone to use for date grouping. The timezone to apply when grouping by date component.
Date component to group by.
Date component to group by (only for DATE_COMPONENT
grouping type).
Values must be one of the following:
MILLISECOND
SECOND
MINUTE
HOUR
YEAR_DAY
MONTH_DAY
WEEK_DAY
WEEK
WEEK_OF_YEAR
MONTH
QUARTER
YEAR
Timezone to use for date grouping. The timezone to apply when grouping by date component.
Date interval to use. The interval unit to apply when grouping by range.
Values must be one of the following:
MILLISECOND
SECOND
MINUTE
HOUR
DAY
WEEK
MONTH
QUARTER
YEAR
Count of intervals in each range.
Number of intervals in each range when groupingType
is RANGE
.
Number of bins. Number of bins to divide the data into.
Values must be in the following range:
Use pretty bins. Whether to adjust bin edges to be more human-readable.
Range size as number for RANGE
grouping or number of quantiles.
Specify a range size directly as a number when groupingType
is RANGE
or number of quantiles when groupingType
is QUANTILES
.
Null is accepted when no param is needed for the grouping type.
Aggregation functions to apply. An array specifying the aggregation functions to apply on each group. The array can be empty, in which case no aggregations are performed, but the dataset is still grouped by the specified columns.
Array items
Name of the output column.
Column on which the aggregation is applied. If null, the aggregation applies to the entire group.
Type of aggregation function.
The type of aggregation function to perform on the specified column.
Includes support for standard aggregations (e.g., SUM
, COUNT
) as well as element-wise aggregations.
Notes:
PERCENT_OF_ROWS_WHERE
: Computes the percentage within each group where a condition is true.PERCENT_OF_ROWS
: Computes the percentage relative to the total number of rows across all groups.Values must be one of the following:
COUNT
MIN
MAX
SUM
AVG
VARIANCE
STDEV
FIRST
LAST
P25
P50
P75
COUNT_WHERE
NUMBER_OF_ROWS
NUMBER_OF_ROWS_WHERE
PERCENT_OF_ROWS
PERCENT_OF_ROWS_WHERE
METRIC
MODE
UNIQUE_VALUES
LIST_UNIQUE
LIST
CONCATENATE
ELEMENT_COUNT
ELEMENT_MIN
ELEMENT_MAX
ELEMENT_SUM
ELEMENT_AVG
ELEMENT_VARIANCE
ELEMENT_STDEV
ELEMENT_FIRST
ELEMENT_LAST
The graphext advanced query used to identify the rows to select previous to the grouping.
Group data by specified columns and apply aggregation functions to each group.
The following examples show how the step can be used in a recipe.
Examples
This example groups the dataset by an exact match on the category
column and a date component (month level) on the date
column, and then aggregates the count of sales
and the sum of revenue
:
This example groups the dataset by an exact match on the category
column and a date component (month level) on the date
column, and then aggregates the count of sales
and the sum of revenue
:
This example uses the simplified by
parameter to group by an exact match on category
. The aggregation calculates the average of revenue
for each group:
This example groups by category
and creates a sorted list of values
based on the sortCol
column:
General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.
The following are the inputs expected by the step and the outputs it produces. These are generally
columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Inputs
The input dataset containing the columns to group by and apply aggregations on.
Outputs
A dataset containing the aggregated results based on the grouping operations.
The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output)
.
Parameters
Columns to group by.
An array specifying the columns used for grouping. The by
parameter can be either:
["column1", "column2"]
), which defaults to EXACT
grouping.by
, groupingType
, optional name
and optional param
properties.Array items
Column name to group by.
Column name to group by.
Column to group by.
Name of the output column. It is optional and defaults to the column name (by
parameter).
Type of grouping operation. The type of grouping operation. You can group by exact value match, a date component, a range of numerical values, or quantiles.
Values must be one of the following:
EXACT
DATE_COMPONENT
RANGE
QUANTILES
Grouping parameters.
Options
Date component to group by.
Date component to group by (only for DATE_COMPONENT
grouping type).
Values must be one of the following:
MILLISECOND
SECOND
MINUTE
HOUR
YEAR_DAY
MONTH_DAY
WEEK_DAY
WEEK
WEEK_OF_YEAR
MONTH
QUARTER
YEAR
Timezone to use for date grouping. The timezone to apply when grouping by date component.
Date component to group by.
Date component to group by (only for DATE_COMPONENT
grouping type).
Values must be one of the following:
MILLISECOND
SECOND
MINUTE
HOUR
YEAR_DAY
MONTH_DAY
WEEK_DAY
WEEK
WEEK_OF_YEAR
MONTH
QUARTER
YEAR
Timezone to use for date grouping. The timezone to apply when grouping by date component.
Date interval to use. The interval unit to apply when grouping by range.
Values must be one of the following:
MILLISECOND
SECOND
MINUTE
HOUR
DAY
WEEK
MONTH
QUARTER
YEAR
Count of intervals in each range.
Number of intervals in each range when groupingType
is RANGE
.
Number of bins. Number of bins to divide the data into.
Values must be in the following range:
Use pretty bins. Whether to adjust bin edges to be more human-readable.
Range size as number for RANGE
grouping or number of quantiles.
Specify a range size directly as a number when groupingType
is RANGE
or number of quantiles when groupingType
is QUANTILES
.
Null is accepted when no param is needed for the grouping type.
Aggregation functions to apply. An array specifying the aggregation functions to apply on each group. The array can be empty, in which case no aggregations are performed, but the dataset is still grouped by the specified columns.
Array items
Name of the output column.
Column on which the aggregation is applied. If null, the aggregation applies to the entire group.
Type of aggregation function.
The type of aggregation function to perform on the specified column.
Includes support for standard aggregations (e.g., SUM
, COUNT
) as well as element-wise aggregations.
Notes:
PERCENT_OF_ROWS_WHERE
: Computes the percentage within each group where a condition is true.PERCENT_OF_ROWS
: Computes the percentage relative to the total number of rows across all groups.Values must be one of the following:
COUNT
MIN
MAX
SUM
AVG
VARIANCE
STDEV
FIRST
LAST
P25
P50
P75
COUNT_WHERE
NUMBER_OF_ROWS
NUMBER_OF_ROWS_WHERE
PERCENT_OF_ROWS
PERCENT_OF_ROWS_WHERE
METRIC
MODE
UNIQUE_VALUES
LIST_UNIQUE
LIST
CONCATENATE
ELEMENT_COUNT
ELEMENT_MIN
ELEMENT_MAX
ELEMENT_SUM
ELEMENT_AVG
ELEMENT_VARIANCE
ELEMENT_STDEV
ELEMENT_FIRST
ELEMENT_LAST
The graphext advanced query used to identify the rows to select previous to the grouping.