filter_duplicates
Filter duplicate rows, keeping the first or last of each set of duplicates found only.
An input dataset to filter.
A dataset containing the same columns as the input dataset but including or excluding the matched rows.
Names of columns used to detect and filter rows containing duplicate values. If not provided, will inspect all columns. Note that multivalued columns, i.e. those containing lists of values will always be ignored when searching for duplicates (but will be included in the result).
Which of a duplicate set of rows to keep in the result. Specifically, whether to keep the first or last row amongst the duplicates.
Values must be one of the following:
first
last
if true
, inverts the row selection.
I.e., only rows being duplicates (in the selected columns) will be included in the resulting dataset.
Row sorting before de-duplication. E.g. when the order of first or last duplicate to retain depends on other variables. If not configured, no sorting will be performed.
Sort column name(s). These column(s) will be used to sort the dataset before de-duplication (if multiple, in specified order).
Whether to sort in ascending order (or in descending order if false).
If an array, must have the same length as by
and specify the sort order for each column.
Was this page helpful?