Skip to content

Filter containing

Filter rows containing any or all of a number of specified values.

Includes or excludes rows of the input datset based on the values of a selected text or list column. Depending on the configuration, if the column contains any or all of the specified values, the corresponding rows will be kept or dropped in the output dataset.

"Containment" here means texts in a text column containing one or more specified substrings (words), or lists in a list column containing one or more elements matching the specified values. See below for illustrative examples.

Usage


The following are the step's expected inputs and outputs and their specific types.

Step signature
filter_containing(ds_in: dataset, {
    "param": value
}) -> (ds_out: dataset)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Example

E.g., to keep only those rows whose values in the "address" column contain the text string "Madrid":

Example call (in recipe editor)
filter_containing(ds, {"column": "address", "values": ["Madrid"]}) -> (ds_filtered)
More examples

Or, given a dataset with the column "jobs", containing lists of one or more job categories in each row, to keep only those rows where the list includes the word "journalist" (ignoring the letter case, i.e. upper or lower case):

Example call (in recipe editor)
filter_containing(ds, {
  "column": "jobs",
  "values": ["journalist"],
  "case_sensitive": false
}) -> (ds_filtered)

Inputs


ds_in: dataset

An input dataset to filter.

Outputs


ds_out: dataset

A dataset containing the same columns as the input dataset, but including or excluding the matched rows.

Parameters


column: string

Name of column to be matched against the specified values


values: number | string | array

Values to be matched in each row to decide its inclusion or exclusion. May be a single value or a list of values to be matched.

Example parameter values:

  • "the"
  • ["the", "cat"]
  • 2
  • [2, 3]

exclude: boolean = False

If true, matching rows will be excluded from the output dataset. I.e., only rows not containing the specified values will be returned.


contains_all: boolean = False

Rows must contain all specified value to pass filter, rather than any.


case_sensitive: boolean = True

Text values must match case to pass filter.