Skip to main content
Converts raw timestamped events, or contiguous time series data, from their original frequency to daily, weekly, monthly, quarterly, yearly or other frequencies. Essentially, groups and aggregates each time or event series by the specified time period and applies desired aggregations (count of events, total spend etc.). The step accepts input data, and can generate output data, in both a tall and a wide format:
  • Tall format
    Each row represents a single event or observation, and the dataset contains scalar columns for the event’s timestamp as well as for identifying the series, customer or entity the event belongs to. This is the most common format for event data and the most probable to have been imported in Graphext.
  • Wide format
    Each row represents a single entity (customer), and the dataset contains columns of lists containing the event timestamps and values for each series or observation. In this case, all lists in the same row must have the same length. This format is the most convenient for analysis in Graphext, as it allows for easy exploration of the time series data. You maintain one row per customer (entity), instead of duplicating the customer’s information for each event, yet you can still access, plot and generally work with all of the customer’s time series.
Note that both formats have the same number of columns. The difference is that in the “tall” format, each row represents a single event, while in the “wide” format, each row represents a single entity and contains all its events. You can think of the wide format as the result of aggregating by the time series identifier, and collecting the timestamps and values in parallel columns of lists. You can use the parameters below to configure the frequency to resample the data to, the format of the output dataset (tall vs wide), whether to fill gaps in the resampled data etc.

Usage

The following example shows how the step can be used in a recipe.

Examples

  • Example 1
  • Signature
This examples resamples a dataset ds of shopping events in “tall” format to a weekly frequency, calculating the number of events per customer per week, weekly total and average spend, and the percentage of purchases in the category “pet food”. The original value columns to be aggredated are “price” and “category”. It also requests the output to be in “wide” format, which is the most convenient in Graphext. Setting fill_gaps to true ensures that the resampled data contains rows for all weeks between a customer’s first and last event, even those with no events.
resample(ds, {
  "id": "customer_id",
  "time": "timestamp",
  "freq": "W",
  "output": "wide",
  "fill_gaps": true,
  "aggregations": {
    "price": {
      "total_spend": {"func": "sum"},
      "avg_spend": {"func": "mean"}
    },
    "id": {
      "num_items_purchased": {"func": "count"}
    }
    "category": {
      "pct_pet_food": {"func": "percent_where", "value": "pet food"}
    }
  }
}) -> (weekly)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").
ds_in
dataset
required
A dataset of events or time series to resample. Must have id, timestamp and at least one value column.
ds_out
dataset
required
The resampled dataset in the configured format. Will have id and timestamp columns, as well as one for each aggregation function specified in the aggregations parameter. The timestamp and aggregated value columns will have scalar values if the output was requested in “tall” format, or lists if in “wide” format.

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

id
[string, array[string]]
required
Entity identifier(s). I.e. name of the column(s) containing the entity identifier. If there are multiple time series in your dataset, e.g. one per customer, this is the column that identifies the series. If there are multiple ID columns, the time series will be grouped by the unique combination of values in these columns.
Item
string (ds_in.column)
Each item in array.
  • customer_id
  • [‘last_name’, ‘first_name’, ‘birthday’]
time
string (ds_in.column)
required
Timestamp. Name of the column containing the event timestamp.
freq
string
required
Frequency. Alias of the frequency to resample the data to. The following are the possible values for the freq parameter. Also see the corresponding Pandas documentation for more details on each frequency.
AliasDescription
BBusiness day (weekday)
DCalendar day (absolute)
WWeek, optionally anchored on a day of the week (W-SUN…)
MECalendar month end (last day of month)
SMESemi-month end (15th and end of month)
BMELast business day of month
MSCalendar month start (first day mof month)
SMSSemi-month start (1st and 15th)
BMSFirst business day of month
QECalendar quarter end
BQEBusiness quarter end
QSCalendar Quarter start
BQSBusiness quarter start
YE/A/YCalendar year end
BYE/BA/BYBusiness year end
YS/AS/YSCalendar year start
BYS/BAS/BYSBusiness year start
h/HHour
bh/BHBusiness hour
min/TMinute
s/SSecond
ms/LMillisecond
us/UMicrosecond
ns/NNanosecond.
Values must be one of the following:B D W M ME SM SME BM BME MS SMS BMS Q QE BQ BQE QS BQS A Y YE BA BY BYE AS BAS BYS h H bh BH T min S s L ms U us N ns
anchor
[string, null]
Anchor. The date to anchor the resampling on. For example, if the frequency is “W” (weekly) and the anchor is “WED”, the resampling will be done on periods between consecutive Wednesdays. If the frequency is “YS” (yearly) and the anchor is “JUL”, the resampling will be annually with each period ending at the end of July. The anchor can only be used with weekly, quarterly and yearly frequencies.Values must be one of the following:MON TUE WED THU FRI SAT SUN JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
output
string
default:"tall"
Output format. The format of the output dataset. In “tall” format, each row represents a single event or observation, and columns contain scalar values. In the “wide” format, each row represents a single entity, and the dataset contains columns of lists (of the same length within each row).Values must be one of the following:
  • tall
  • wide
fill_gaps
boolean
default:"true"
Fill gaps. Whether to fill gaps in the resampled data with NaN/0 values. If set to false, the resampled data will only contain rows for which there are events in the original data. If set to true, the resampled data will contain rows for all periods in the resampled frequency, with NaN/0 values for periods with no events.
aggregations
object
required
Definition of desired aggregations. A dictionary mapping original columns to new aggregated columns, specifying an aggregation function for each. Aggregations are functions that reduce all the values in a particular column of a single group to a single summary value of that group. E.g. a sum aggregation of column A calculates a single total by adding up all the values in A belonging to each group.In contrast to the more generic aggregate and group_by steps, for time series resampling, only functions returning scalar values are supported. Allowed options for the func parameters are:
  • n, size or count: calculate number of rows in group
  • sum: sum total of values
  • mean: take mean of values
  • max: take max of values
  • min: take min of values
  • first: take first item found
  • last: take last item found
  • n_unique: count the number of unique values
  • concatenate: convert all values to text and concatenate them into one long text
  • count_where: number of rows in which the column matches a value, needs parameter value with the value that you want to count
  • percent_where: percentage of the column where the column matches a value, needs parameter value with the value that you want to count
Note that in the case of count_where and percent_where an additional value parameter is required.
input_aggregations
object
One item per input column. Each key should be the name of an input column, and each value an object defining one or more aggregations for that column. An individual aggregation consists of the name of a desired output column, mapped to a specific aggregation function. For example:
{
"input_col": {
"output_col": {"func": "sum"}
}
}
aggregation_func
object
Object defining how to aggregate a single output column. Needs at least the "func" parameter. If the aggregation function accepts further arguments, like the "value" parameter in case of count_where and percent_where, these need to be provided also. For example:
{
"output_col": {"func": "count_where", "value": 2}
}
func
string
Aggregation function.Values must be one of the following:n size count sum mean n_unique count_where percent_where concatenate max min first last list
  • Including an aggregation function with additional parameters:
{
"product_id": {
"products": {"func": "list"},
"size": {"func": "count"}
},
"item_total": {
"total": {"func": "sum"},
},
"item_category": {
"num_food_items": {"func": "count_where", "value": "food"}
}
}
I