featurize_time_series
Summarizes time series data into aggregate metrics.
Extracts features from time series data for machine learning or analysis. Supports three feature sets:
- catch22: 22 time series features, plus optional mean and standard deviation (24 total). See details about each feature here.
- tsfeatures: Statistical features including trend, seasonality, autocorrelation, etc. See details about each feature here.
- growth: Simple, average, compound, and linear growth metrics.
The step takes a dataset with time series data in “tall” format (one row per time point) or “wide” format (time points in columns), and produces a dataset with the calculated features.
Usage
The following example shows how the step can be used in a recipe.
To calculate all “growth” metrics:
To calculate all “growth” metrics:
General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.
Inputs & Outputs
The following are the inputs expected by the step and the outputs it produces. These are generally
columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Configuration
The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output)
.
Column name containing the time series identifier.
Column name containing the timestamps.
Column name containing the values to featurize.
Feature sets to include.
E.g. “catch22” or “tsfeatures”. If no individual features are configured using the features
pararmeter,
all features from the selected set will be computed. If multiple sets are selected, all features from
each set will be computed. If all
is selected, all features from all sets will be computed.
Values must be one of the following:
catch22
tsfeatures
growth
all
Custom features to compute from each feature set.
Frequency to use by features in the TSFeatures set. The number of observations in a single cycle. Used by certain features (for now only in the tsfeatures set), that are based on seasonality. When a string (character) is provided, this is interpreted as the natural frequency of the time series and will be translated to the number of observations per cycle using the following mapping:
- ‘H’: 24 (hourly)
- ‘D’: 1 (daily)
- ‘M’: 12 (monthly)
- ‘Q’: 4 (quarterly)
- ‘W’: 1 (weekly)
- ‘Y’: 1 (yearly)
E.g. if the natural frequency of the time series is monthly (‘M’), will
analyze seasonality with a period of 12 observations (months in a year). If a number is provided,
this will be interpreted directly as the number of observations per cycle. If null
, attempts to
infer the frequency automatically.
Also see this post by the author of
the original tsfeatures
package for more details on seasonality and the frequency parameter.
Temporal unit to use. Only required for converting the time column to timestamps when it is numeric. Y=years, M=months, W=weeks, D=days, h=hours, m=minutes, s=seconds, ms=milliseconds, us=microseconds, ns=nanoseconds.
Values must be one of the following:
Y
M
W
D
h
m
s
ms
us
ns
Output format. The format of the output dataset. The following options are supported:
- “wide”: One row per time series with features as multivalues (list) columns
- “tall”: Features joined to the original data, preserving all rows.
Values must be one of the following:
wide
tall
Number of parallel jobs. If -1, all processors are used. If 1, no parallel computing code is used at all, which is useful for debugging. Using multiple processes with a large dataset may cause memory issues.
Values must be in the following range: