resample

Converts raw timestamped events, or contiguous time series data, from their original frequency to daily, weekly, monthly, quarterly, yearly or other frequencies. Essentially, groups and aggregates each time or event series by the specified time period and applies desired aggregations (count of events, total spend etc.). The step accepts input data, and can generate output data, in both a tall and a wide format:

Tall format
Each row represents a single event or observation, and the dataset contains scalar columns for the event’s timestamp as well as for identifying the series, customer or entity the event belongs to. This is the most common format for event data and the most probable to have been imported in Graphext.
Wide format
Each row represents a single entity (customer), and the dataset contains columns of lists containing the event timestamps and values for each series or observation. In this case, all lists in the same row must have the same length. This format is the most convenient for analysis in Graphext, as it allows for easy exploration of the time series data. You maintain one row per customer (entity), instead of duplicating the customer’s information for each event, yet you can still access, plot and generally work with all of the customer’s time series.

Note that both formats have the same number of columns. The difference is that in the “tall” format, each row represents a single event, while in the “wide” format, each row represents a single entity and contains all its events. You can think of the wide format as the result of aggregating by the time series identifier, and collecting the timestamps and values in parallel columns of lists. You can use the parameters below to configure the frequency to resample the data to, the format of the output dataset (tall vs wide), whether to fill gaps in the resampled data etc.

Usage

The following example shows how the step can be used in a recipe.

Examples

This examples resamples a dataset ds of shopping events in “tall” format to a weekly frequency, calculating the number of events per customer per week, weekly total and average spend, and the percentage of purchases in the category “pet food”. The original value columns to be aggredated are “price” and “category”. It also requests the output to be in “wide” format, which is the most convenient in Graphext. Setting fill_gaps to true ensures that the resampled data contains rows for all weeks between a customer’s first and last event, even those with no events.

resample(ds, {
  "id": "customer_id",
  "time": "timestamp",
  "freq": "W",
  "output": "wide",
  "fill_gaps": true,
  "aggregations": {
    "price": {
      "total_spend": {"func": "sum"},
      "avg_spend": {"func": "mean"}
    },
    "id": {
      "num_items_purchased": {"func": "count"}
    }
    "category": {
      "pct_pet_food": {"func": "percent_where", "value": "pet food"}
    }
  }
}) -> (weekly)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Inputs

Outputs

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

[string, array[string]]

required

Entity identifier(s). I.e. name of the column(s) containing the entity identifier. If there are multiple time series in your dataset, e.g. one per customer, this is the column that identifies the series. If there are multiple ID columns, the time series will be grouped by the unique combination of values in these columns.

Array items

Examples

time

string (ds_in.column)

required

Timestamp. Name of the column containing the event timestamp.

freq

string

required

Frequency. Alias of the frequency to resample the data to. The following are the possible values for the freq parameter. Also see the corresponding Pandas documentation for more details on each frequency.

Alias	Description
B	Business day (weekday)
D	Calendar day (absolute)
W	Week, optionally anchored on a day of the week (W-SUN…)
ME	Calendar month end (last day of month)
SME	Semi-month end (15th and end of month)
BME	Last business day of month
MS	Calendar month start (first day mof month)
SMS	Semi-month start (1st and 15th)
BMS	First business day of month
QE	Calendar quarter end
BQE	Business quarter end
QS	Calendar Quarter start
BQS	Business quarter start
YE/A/Y	Calendar year end
BYE/BA/BY	Business year end
YS/AS/YS	Calendar year start
BYS/BAS/BYS	Business year start
h/H	Hour
bh/BH	Business hour
min/T	Minute
s/S	Second
ms/L	Millisecond
us/U	Microsecond
ns/N	Nanosecond	.

Values must be one of the following:B D W M ME SM SME BM BME MS SMS BMS Q QE BQ BQE QS BQS A Y YE BA BY BYE AS BAS BYS h H bh BH T min S s L ms U us N ns

anchor

[string, null]

Anchor. The date to anchor the resampling on. For example, if the frequency is “W” (weekly) and the anchor is “WED”, the resampling will be done on periods between consecutive Wednesdays. If the frequency is “YS” (yearly) and the anchor is “JUL”, the resampling will be annually with each period ending at the end of July. The anchor can only be used with weekly, quarterly and yearly frequencies.Values must be one of the following:MON TUE WED THU FRI SAT SUN JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC

output

string

default:"tall"

Output format. The format of the output dataset. In “tall” format, each row represents a single event or observation, and columns contain scalar values. In the “wide” format, each row represents a single entity, and the dataset contains columns of lists (of the same length within each row).Values must be one of the following:

tall
wide

fill_gaps

boolean

default:"true"

Fill gaps. Whether to fill gaps in the resampled data with NaN/0 values. If set to false, the resampled data will only contain rows for which there are events in the original data. If set to true, the resampled data will contain rows for all periods in the resampled frequency, with NaN/0 values for periods with no events.

aggregations

object

required

Definition of desired aggregations. A dictionary mapping original columns to new aggregated columns, specifying an aggregation function for each. Aggregations are functions that reduce all the values in a particular column of a single group to a single summary value of that group. E.g. a sum aggregation of column A calculates a single total by adding up all the values in A belonging to each group.In contrast to the more generic aggregate and group_by steps, for time series resampling, only functions returning scalar values are supported. Allowed options for the func parameters are:

n, size or count: calculate number of rows in group
sum: sum total of values
mean: take mean of values
max: take max of values
min: take min of values
first: take first item found
last: take last item found
n_unique: count the number of unique values
concatenate: convert all values to text and concatenate them into one long text
count_where: number of rows in which the column matches a value, needs parameter value with the value that you want to count
percent_where: percentage of the column where the column matches a value, needs parameter value with the value that you want to count

Note that in the case of count_where and percent_where an additional value parameter is required.

Item properties

input_aggregations

object

One item per input column. Each key should be the name of an input column, and each value an object defining one or more aggregations for that column. An individual aggregation consists of the name of a desired output column, mapped to a specific aggregation function. For example:

{
"input_col": {
"output_col": {"func": "sum"}
}
}

Item properties

aggregation_func

object

Object defining how to aggregate a single output column. Needs at least the "func" parameter. If the aggregation function accepts further arguments, like the "value" parameter in case of count_where and percent_where, these need to be provided also. For example:

{
"output_col": {"func": "count_where", "value": 2}
}

Properties

Examples

Including an aggregation function with additional parameters:

{
"product_id": {
"products": {"func": "list"},
"size": {"func": "count"}
},
"item_total": {
"total": {"func": "sum"},
},
"item_category": {
"num_food_items": {"func": "count_where", "value": "food"}
}
}

Prepare

Report

Analyse

Usage

Inputs & Outputs

Configuration

Prepare

Report

Analyse

​Usage

​Inputs & Outputs

​Configuration

Usage

Inputs & Outputs

Configuration