Skip to content

Aggregate

group by

Group and aggregate a dataset using any of a number of predefined functions.

After optionally sorting the dataset, it is grouped by the unique values (or combinations of unique values) in one or more columns. Each group's rows are then aggregated using one or more predefined functions. A new dataset is thus created containing one column per selected aggregation function, and one row for each unique group.

Usage


The following are the step's expected inputs and outputs and their specific types.

Step signature
aggregate(ds_in: dataset, {
    "param": value
}) -> (ds_out: dataset)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Example

Given an online retail dataset products, where rows represent items with id product_id, and which have been added to a shopping basket at time time_added, we can aggregate these items into a new dataset baskets containing one row per basket. The following configuration calculates this aggregation, creating a new dataset with three columns:

  • products: a list of all items in a given basket, preserving the order they were added
  • size: the number of items in the basket
  • total: the total value of the basket
Example call (in recipe editor)
aggregate(products, {
  "by": "order_id",
  "presort": {
    "columns": "time_added"
  },
  "aggregations": {
    "product_id": {
      "products": {"func": "list"},
      "size": {"func": "count"}
    },
    "item_total": {
      "total": {"func": "sum"}}
    }
}) -> (baskets)

Inputs


ds_in: dataset

A dataset to group and aggregate.

Outputs


ds_out: dataset

The result of the aggregation.

Parameters


by: string | array

Grouping column(s). The name(s) of column(s) whose unique values define the groups to aggregate.

Example parameter values:

  • "order_id"
  • ["weekday", "hour"]

presort: object

Pre-aggregation row sorting. Sort the dataset rows before aggregating, e.g. when in a particular aggregation function (such as list) the encountered order is important.

Items in presort

columns: null | string | array[string]

The sort column name(s). These column(s) will be used to sort the dataset before aggregating (if multiple, in specified order).

Example parameter values:

  • "date_added"
  • ["lastname", "firstname"]

ascending: boolean = True

Whether to sort in ascending order (or in descending order if false).

Example parameter values:

  • With a single column for sorting:

    "presort": {
      "columns": "date_added",
      "ascending": true
    }
    

aggregations: object

Definition of desired aggregations. A dictionary mapping original columns to new aggregated columns, specifying an aggregation function for each. Aggregations are functions that reduce all the values in a particular column of a single group to a single summary value of that group. E.g. a sum aggregation of column A calculates a single total by adding up all the values in A belonging to each group.

Possible aggregations functions accepted as func parameters are:

  • n, size or count: calculate number of rows in group
  • sum: sum total of values
  • mean: take mean of values
  • max: take max of values
  • min: take min of values
  • first: take first item found
  • last: take last item found
  • unique: collect a list of unique values
  • n_unique: count the number of unique values
  • list: collect a list of all values
  • concatenate: convert all values to text and concatenate them into one long text
  • concat_lists: concatenate lists in all rows into a single larger list
  • count_where: number of rows in which the column matches a value, needs parameter value with the value that you want to count
  • percent_where: percentage of the column where the column matches a value, needs parameter value with the value that you want to count

Note that in the case of count_where and percent_where an additional value parameter is required.

Items in aggregations

input_aggregations: object

One item per input column. Each key should be the name of an input column, and each value an object defining one or more aggregations for that column. An individual aggregation consists of the name of a desired output column, mapped to a specific aggregation function. For example:

{
  "input_col": {
    "output_col": {"func": "sum"}
  }
}
Items in input_aggregations

aggregation_func: object

Object defining how to aggregate a single output column. Needs at least the "func" parameter. If the aggregation function accepts further arguments, like the "value" parameter in case of count_where and percent_where, these need to be provided also. For example:

{
  "output_col": {"func": "count_where", "value": 2}
}
Items in aggregation_func

func: string

Aggregation function.

Must be one of: "n", "size", "count", "sum", "mean", "n_unique", "count_where", "percent_where", "concatenate", "max", "min", "first", "last", "concat_lists", "unique", "list"

Example parameter values:

  • Including an aggregation function with additional parameters:

    {
      "product_id": {
        "products": {"func": "list"},
        "size": {"func": "count"}
      },
      "item_total": {
        "total": {"func": "sum"},
      },
      "item_category": {
        "num_food_items": {"func": "count_where", "value": "food"}
      }
    }
    

drop_nan: boolean = False

Whether to ignore missing values (NaNs) in group columns. If false (default), missing values (NaNs) will be grouped together in their own group. Otherwise, rows containing NaNs in the group column will be ignored.


sort_groups: boolean = False

Whether to sort groups by values in the grouping columns. This doesn't affect sorting of rows within groups, which is always maintained (and may depend on the presort parameter), but only the ordering amongst groups. If the order of groups is not important, leaving this off will usually result in faster execution (false by default) .


force_pandas: boolean = False

Enforce use of Pandas aggregation. Normally, depending on dataset size, the step will automatically switch between Pandas and Dask aggregation, preferring whichever represents a better trade-off between execution-time and memory usage. For very large datasets, Dask is the only viable method, but Dask has limitations when it comes to sorting. For intermediate dataset sizes, and if you need to sort the dataset before aggregation on more than a single column, you can try enforcing the use of Pandas if otherwise you see warning or errors related to sorting.