Skip to main content

Usage

The following example shows how the step can be used in a recipe.

Examples

  • Example 1
  • Signature
To keep only the first row amongst a set of duplicates, identifying duplicates by inspecting values in columns “address” and “name”
filter_duplicates(ds, {"columns": ["address", "name"], "keep": "first"}) -> (ds_filtered)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").
ds_in
dataset
required
An input dataset to filter.
ds_out
dataset
required
A dataset containing the same columns as the input dataset but including or excluding the matched rows.

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

columns
[array[string], null]
Names of columns used to detect and filter rows containing duplicate values. If not provided, will inspect all columns. Note that multivalued columns, i.e. those containing lists of values will always be ignored when searching for duplicates (but will be included in the result).
Item
string (ds_in.column)
Each item in array.
keep
string
default:"first"
Which of a duplicate set of rows to keep in the result. Specifically, whether to keep the first or last row amongst the duplicates.Values must be one of the following:
  • first
  • last
exclude
boolean
default:"false"
if true, inverts the row selection. I.e., only rows being duplicates (in the selected columns) will be included in the resulting dataset.
presort
object
Row sorting before de-duplication. E.g. when the order of first or last duplicate to retain depends on other variables. If not configured, no sorting will be performed.
by
[null, string, array]
Sort column name(s). These column(s) will be used to sort the dataset before de-duplication (if multiple, in specified order).
  • null
  • string
  • array
{_}
null
null.
ascending
[boolean, array[boolean]]
default:"true"
Whether to sort in ascending order (or in descending order if false). If an array, must have the same length as by and specify the sort order for each column.
Item
boolean
Each item in array.
I