Skip to content

Filter duplicates

duplicatesdeduplicate

Filter duplicate rows, keeping the first or last of each set of duplicates found only.

Usage


The following are the step's expected inputs and outputs and their specific types.

Step signature
filter_duplicates(ds_in: dataset, {
    "param": value
}) -> (ds_out: dataset)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Example

To keep only the first row amongst a set of duplicates, identifying duplicates by inspecting values in columns "address" and "name"

Example call (in recipe editor)
filter_duplicates(ds, {"columns": ["address", "name"], "keep": "first"}) -> (ds_filtered)

Inputs


ds_in: dataset

An input dataset to filter.

Outputs


ds_out: dataset

A dataset containing the same columns as the input dataset but including or excluding the matched rows.

Parameters


columns: array | null

Names of columns used to detect and filter rows containing duplicate values. If not provided, will inspect all columns. Note that multivalued columns, i.e. those containing lists of values will always be ignored when searching for duplicates (but will be included in the result).


keep: string = "first"

Which of a duplicate set of rows to keep in the result. Specifically, whether to keep the first or last row amongst the duplicates.

Must be one of: "first", "last"


exclude: boolean = False

if true, inverts the row selection. I.e., only rows being duplicates (in the selected columns) will be included in the resulting dataset.