This step calculates the duration between a start date and an end date and determines whether an event was observed. The output consists of two columns:

  • duration: The time interval between the start and end dates in the specified unit (default: days).
  • observed: A boolean column indicating whether the event was observed (i.e., if the end date occurs before the observation date).

This is particularly useful for preparing input data for survival analysis, such as Kaplan-Meier curves, where the event observation (censoring) status and duration are key inputs.

  • If either start_date or end_date is missing (null), observed will be false, and duration will be null.
  • Otherwise, the duration is calculated as the interval between start_date and end_date.
  • If end_date is not null, observed will be true if end_date <= observation_end; otherwise, it will be false.

Usage

The following examples show how the step can be used in a recipe.

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).