resample
Resamples a dataset of events or time series to the desired frequency.
Converts raw timestamped events, or contiguous time series data, from their original frequency to daily, weekly, monthly, quarterly, yearly or other frequencies. Essentially, groups and aggregates each time or event series by the specified time period and applies desired aggregations (count of events, total spend etc.).
The step accepts input data, and can generate output data, in both a tall
and a wide
format:
-
Tall format
Each row represents a single event or observation, and the dataset contains scalar columns for the event’s timestamp as well as for identifying the series, customer or entity the event belongs to. This is the most common format for event data and the most probable to have been imported in Graphext. -
Wide format
Each row represents a single entity (customer), and the dataset contains columns of lists containing the event timestamps and values for each series or observation. In this case, all lists in the same row must have the same length. This format is the most convenient for analysis in Graphext, as it allows for easy exploration of the time series data. You maintain one row per customer (entity), instead of duplicating the customer’s information for each event, yet you can still access, plot and generally work with all of the customer’s time series.
Note that both formats have the same number of columns. The difference is that in the “tall” format, each row represents a single event, while in the “wide” format, each row represents a single entity and contains all its events. You can think of the wide format as the result of aggregating by the time series identifier, and collecting the timestamps and values in parallel columns of lists.
You can use the parameters below to configure the frequency to resample the data to, the format of the output dataset (tall vs wide), whether to fill gaps in the resampled data etc.
Usage
The following example shows how the step can be used in a recipe.
This examples resamples a dataset ds
of shopping events in “tall” format to a weekly frequency, calculating the number of events per customer per week, weekly total and average spend, and the percentage of purchases in the category “pet food”. The original value columns to be aggredated are “price” and “category”. It also requests the output to be in “wide” format, which is the most convenient in Graphext. Setting fill_gaps
to true
ensures that the resampled data contains rows for all weeks between a customer’s first and last event, even those with no events.
Inputs & Outputs
The following are the inputs expected by the step and the outputs it produces. These are generally
columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Configuration
The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output)
.
Entity identifier(s). I.e. name of the column(s) containing the entity identifier. If there are multiple time series in your dataset, e.g. one per customer, this is the column that identifies the series. If there are multiple ID columns, the time series will be grouped by the unique combination of values in these columns.
Timestamp. Name of the column containing the event timestamp.
Frequency.
Alias of the frequency to resample the data to. The following are the possible values for the freq
parameter.
Also see the corresponding Pandas documentation
for more details on each frequency.
Alias | Description | |
---|---|---|
B | Business day (weekday) | |
D | Calendar day (absolute) | |
W | Week, optionally anchored on a day of the week (W-SUN…) | |
ME | Calendar month end (last day of month) | |
SME | Semi-month end (15th and end of month) | |
BME | Last business day of month | |
MS | Calendar month start (first day mof month) | |
SMS | Semi-month start (1st and 15th) | |
BMS | First business day of month | |
QE | Calendar quarter end | |
BQE | Business quarter end | |
QS | Calendar Quarter start | |
BQS | Business quarter start | |
YE/A/Y | Calendar year end | |
BYE/BA/BY | Business year end | |
YS/AS/YS | Calendar year start | |
BYS/BAS/BYS | Business year start | |
h/H | Hour | |
bh/BH | Business hour | |
min/T | Minute | |
s/S | Second | |
ms/L | Millisecond | |
us/U | Microsecond | |
ns/N | Nanosecond | . |
Values must be one of the following:
B
D
W
M
ME
SM
SME
BM
BME
MS
SMS
BMS
Q
QE
BQ
BQE
QS
BQS
A
Y
YE
BA
BY
BYE
AS
BAS
BYS
h
H
bh
BH
T
min
S
s
L
ms
U
us
N
ns
Anchor. The date to anchor the resampling on. For example, if the frequency is “W” (weekly) and the anchor is “WED”, the resampling will be done on periods between consecutive Wednesdays. If the frequency is “YS” (yearly) and the anchor is “JUL”, the resampling will be annually with each period ending at the end of July. The anchor can only be used with weekly, quarterly and yearly frequencies.
Values must be one of the following:
MON
TUE
WED
THU
FRI
SAT
SUN
JAN
FEB
MAR
APR
MAY
JUN
JUL
AUG
SEP
OCT
NOV
DEC
Output format. The format of the output dataset. In “tall” format, each row represents a single event or observation, and columns contain scalar values. In the “wide” format, each row represents a single entity, and the dataset contains columns of lists (of the same length within each row).
Values must be one of the following:
tall
wide
Fill gaps.
Whether to fill gaps in the resampled data with NaN
/0 values. If set to false
, the resampled data will
only contain rows for which there are events in the original data. If set to true
, the resampled data will
contain rows for all periods in the resampled frequency, with NaN
/0 values for periods with no events.
Definition of desired aggregations.
A dictionary mapping original columns to new aggregated columns, specifying an aggregation function for each.
Aggregations are functions that reduce all the values in a particular column of a single group to a single summary value of that group.
E.g. a sum
aggregation of column A calculates a single total by adding up all the values in A belonging to each group.
In contrast to the more generic aggregate
and group_by
steps, for time series resampling, only functions returning scalar values
are supported. Allowed options for the func
parameters are:
n
,size
orcount
: calculate number of rows in groupsum
: sum total of valuesmean
: take mean of valuesmax
: take max of valuesmin
: take min of valuesfirst
: take first item foundlast
: take last item foundn_unique
: count the number of unique valuesconcatenate
: convert all values to text and concatenate them into one long textcount_where
: number of rows in which the column matches a value, needs parametervalue
with the value that you want to countpercent_where
: percentage of the column where the column matches a value, needs parametervalue
with the value that you want to count
Note that in the case of count_where
and percent_where
an additional value
parameter is required.
Was this page helpful?