Filter containing¶
Filter rows containing any or all of a number of specified values.
Includes or excludes rows of the input datset based on the values of a selected text or list column. Depending on the configuration, if the column contains any or all of the specified values, the corresponding rows will be kept or dropped in the output dataset.
"Containment" here means texts in a text column containing one or more specified substrings (words), or lists in a list column containing one or more elements matching the specified values. See below for illustrative examples.
Usage¶
The following are the step's expected inputs and outputs and their specific types.
filter_containing(ds_in: dataset, {"param": value}) -> (ds_out: dataset)
where the object {"param": value}
is optional in most cases and if present may contain any of the parameters described in the
corresponding section below.
Example¶
E.g., to keep only those rows whose values in the "address" column contain the text string "Madrid":
filter_containing(ds, {"column": "address", "values": ["Madrid"]}) -> (ds_filtered)
More examples
Or, given a dataset with the column "jobs", containing lists of one or more job categories in each row, to keep only those rows where the list includes the word "journalist" (ignoring the letter case, i.e. upper or lower case):
filter_containing(ds, {
"column": "jobs",
"values": ["journalist"],
"case_sensitive": false
}) -> (ds_filtered)
Inputs¶
ds_in: dataset
An input dataset to filter.
Outputs¶
ds_out: dataset
A dataset containing the same columns as the input dataset, but including or excluding the matched rows.
Parameters¶
column: string
Name of column to be matched against the specified values
values: number | string | array[number | string]
Values to be matched in each row to decide its inclusion or exclusion. May be a single value or a list of values to be matched.
Example parameter values:
"the"
["the", "cat"]
2
[2, 3]
exclude: boolean = False
If true
, matching rows will be excluded from the output dataset. I.e., only rows not containing the specified values will be returned.
contains_all: boolean = False
Rows must contain all specified value to pass filter, rather than any.
case_sensitive: boolean = True
Text values must match case to pass filter.