featurize_time_series
Summarizes time series data into aggregate metrics.
Extracts features from time series data for machine learning or analysis. Supports three feature sets:
- catch22: 22 time series features, plus optional mean and standard deviation (24 total). See details about each feature here.
- tsfeatures: Statistical features including trend, seasonality, autocorrelation, etc. See details about each feature here.
- growth: Simple, average, compound, and linear growth metrics.
The step takes a dataset with time series data in “tall” format (one row per time point) or “wide” format (time points in columns), and produces a dataset with the calculated features.
Usage
The following example shows how the step can be used in a recipe.
Examples
Examples
To calculate all “growth” metrics:
To calculate all “growth” metrics:
General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.
Inputs & Outputs
The following are the inputs expected by the step and the outputs it produces. These are generally
columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Inputs
Inputs
A dataset containing time series.
Outputs
Outputs
A dataset containing time series features.
Configuration
The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output)
.
Parameters
Parameters
Column name containing the time series identifier.
Column name containing the timestamps.
Column name containing the values to featurize.
Feature sets to include.
E.g. “catch22” or “tsfeatures”. If no individual features are configured using the features
pararmeter,
all features from the selected set will be computed. If multiple sets are selected, all features from
each set will be computed. If all
is selected, all features from all sets will be computed.
Values must be one of the following:
catch22
tsfeatures
growth
all
Array items
Array items
Each item in array.
Values must be one of the following:
catch22
tsfeatures
growth
all
Custom features to compute from each feature set.
Properties
Properties
Catch22 features to compute. See here for detailed information about each possible feature.
Array items
Array items
Each item should be a name of a Catch22 feature.
Values must be one of the following:
mode_5
mode_10
acf_timescale
acf_first_min
ami2
trev
high_fluctuation
stretch_high
transition_matrix
periodicity
embedding_dist
ami_timescale
whiten_timescale
outlier_timing_pos
outlier_timing_neg
centroid_freq
stretch_decreasing
entropy_pairs
rs_range
dfa
low_freq_power
forecast_error
mean
SD
TSFeatures features to compute. See here for detailed information about each possible feature.
Array items
Array items
Each item should be the name of a TSFeature feature.
Values must be one of the following:
acf_features
arch_stat
crossing_points
entropy
flat_spots
heterogeneity
holt_parameters
lumpiness
nonlinearity
pacf_features
stl_features
stability
hw_parameters
unitroot_kpss
unitroot_pp
series_length
hurst
Growth features to compute. The different growth features are calculated as follows, where is the final value, is the initial value, and is the number of periods in a time series.
"simple"
Factional change between first and last value. Maintains direction of growth by dividing the change by the absolute value of the initial value:
"average"
The average fraction of change between consecutive values. Also maintains direction, unlike e.g. pandas pct_change function:
"compound"
Analogous to CAGR (Compound Annual Growth Rate). The average growth rate over the entire period, assuming the growth is compounded:
"linear"
Fits a linear regression to the time series and returns the slope of the line.
Array items
Array items
Each item in array.
Values must be one of the following:
simple
average
compound
linear
Examples
Examples
- E.g. deriving two features from catch22 and growth sets each:
Frequency to use by features in the TSFeatures set. The number of observations in a single cycle. Used by certain features (for now only in the tsfeatures set), that are based on seasonality. When a string (character) is provided, this is interpreted as the natural frequency of the time series and will be translated to the number of observations per cycle using the following mapping:
- ‘H’: 24 (hourly)
- ‘D’: 1 (daily)
- ‘M’: 12 (monthly)
- ‘Q’: 4 (quarterly)
- ‘W’: 1 (weekly)
- ‘Y’: 1 (yearly)
E.g. if the natural frequency of the time series is monthly (‘M’), will
analyze seasonality with a period of 12 observations (months in a year). If a number is provided,
this will be interpreted directly as the number of observations per cycle. If null
, attempts to
infer the frequency automatically.
Also see this post by the author of
the original tsfeatures
package for more details on seasonality and the frequency parameter.
Temporal unit to use. Only required for converting the time column to timestamps when it is numeric. Y=years, M=months, W=weeks, D=days, h=hours, m=minutes, s=seconds, ms=milliseconds, us=microseconds, ns=nanoseconds.
Values must be one of the following:
Y
M
W
D
h
m
s
ms
us
ns
Output format. The format of the output dataset. The following options are supported:
- “wide”: One row per time series with features as multivalues (list) columns
- “tall”: Features joined to the original data, preserving all rows.
Values must be one of the following:
wide
tall
Number of parallel jobs. If -1, all processors are used. If 1, no parallel computing code is used at all, which is useful for debugging. Using multiple processes with a large dataset may cause memory issues.
Values must be in the following range: