Skip to main content
Extracts features from time series data for machine learning or analysis. Supports three feature sets:
  • catch22: 22 time series features, plus optional mean and standard deviation (24 total). See details about each feature here.
  • tsfeatures: Statistical features including trend, seasonality, autocorrelation, etc. See details about each feature here.
  • growth: Simple, average, compound, and linear growth metrics.
The step takes a dataset with time series data in “tall” format (one row per time point) or “wide” format (time points in columns), and produces a dataset with the calculated features.

Usage

The following example shows how the step can be used in a recipe.

Examples

  • Example 1
  • Signature
To calculate all “growth” metrics:
featurize_time_series(ds, {
  "id": "product_id",
  "time": "time_added",
  "value": "item_total",
  "sets": ["growth"]
}) -> (features)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").
ds
dataset
required
A dataset containing time series.
features
dataset
required
A dataset containing time series features.

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

id
string (ds.column)
required
Column name containing the time series identifier.
time
string (ds.column)
required
Column name containing the timestamps.
value
string (ds.column)
required
Column name containing the values to featurize.
sets
[string, array[string]]
default:"catch22"
Feature sets to include. E.g. “catch22” or “tsfeatures”. If no individual features are configured using the features pararmeter, all features from the selected set will be computed. If multiple sets are selected, all features from each set will be computed. If all is selected, all features from all sets will be computed.Values must be one of the following:
  • catch22
  • tsfeatures
  • growth
  • all
Item
string
Each item in array.Values must be one of the following:
  • catch22
  • tsfeatures
  • growth
  • all
features
object
Custom features to compute from each feature set.
catch22
[array[string], string]
Catch22 features to compute. See here for detailed information about each possible feature.
Item
string
Each item should be a name of a Catch22 feature.Values must be one of the following:mode_5 mode_10 acf_timescale acf_first_min ami2 trev high_fluctuation stretch_high transition_matrix periodicity embedding_dist ami_timescale whiten_timescale outlier_timing_pos outlier_timing_neg centroid_freq stretch_decreasing entropy_pairs rs_range dfa low_freq_power forecast_error mean SD
tsfeatures
[array[string], string]
TSFeatures features to compute. See here for detailed information about each possible feature.
Item
string
Each item should be the name of a TSFeature feature.Values must be one of the following:acf_features arch_stat crossing_points entropy flat_spots heterogeneity holt_parameters lumpiness nonlinearity pacf_features stl_features stability hw_parameters unitroot_kpss unitroot_pp series_length hurst
growth
[array[string], string]
Growth features to compute. The different growth features are calculated as follows, where xfx_f is the final value, x0x_0 is the initial value, and nn is the number of periods in a time series."simple"Factional change between first and last value. Maintains direction of growth by dividing the change by the absolute value of the initial value:g=xfx0x0g = \frac{x_f - x_0}{|x_0|}"average"The average fraction of change between consecutive values. Also maintains direction, unlike e.g. pandas pct_change function:g=1ni=1nxixi1xi1g = \frac{1}{n} \sum_{i=1}^{n} \frac{x_i - x_{i-1}}{|x_{i-1}|}"compound"Analogous to CAGR (Compound Annual Growth Rate). The average growth rate over the entire period, assuming the growth is compounded:g=(xfx0)1n1g = \left( \frac{x_f}{x_0} \right)^{\frac{1}{n}} - 1"linear"Fits a linear regression to the time series and returns the slope of the line.
Item
string
Each item in array.Values must be one of the following:
  • simple
  • average
  • compound
  • linear
  • E.g. deriving two features from catch22 and growth sets each:
{
"catch22": ["mode_5", "acf_timescale"],
"growth": ["simple", "linear"],
}
freq
[null, string, integer]
Frequency to use by features in the TSFeatures set. The number of observations in a single cycle. Used by certain features (for now only in the tsfeatures set), that are based on seasonality. When a string (character) is provided, this is interpreted as the natural frequency of the time series and will be translated to the number of observations per cycle using the following mapping:
  • ‘H’: 24 (hourly)
  • ‘D’: 1 (daily)
  • ‘M’: 12 (monthly)
  • ‘Q’: 4 (quarterly)
  • ‘W’: 1 (weekly)
  • ‘Y’: 1 (yearly)
E.g. if the natural frequency of the time series is monthly (‘M’), will analyze seasonality with a period of 12 observations (months in a year). If a number is provided, this will be interpreted directly as the number of observations per cycle. If null, attempts to infer the frequency automatically.Also see this post by the author of the original tsfeatures package for more details on seasonality and the frequency parameter.
unit
string
default:"D"
Temporal unit to use. Only required for converting the time column to timestamps when it is numeric. Y=years, M=months, W=weeks, D=days, h=hours, m=minutes, s=seconds, ms=milliseconds, us=microseconds, ns=nanoseconds.Values must be one of the following:Y M W D h m s ms us ns
output
string
default:"wide"
Output format. The format of the output dataset. The following options are supported:
  • “wide”: One row per time series with features as multivalues (list) columns
  • “tall”: Features joined to the original data, preserving all rows.
Values must be one of the following:
  • wide
  • tall
n_jobs
integer
default:"-1"
Number of parallel jobs. If -1, all processors are used. If 1, no parallel computing code is used at all, which is useful for debugging. Using multiple processes with a large dataset may cause memory issues.Values must be in the following range:
-1n_jobs < inf
I