Skip to content

Filter sample

Randomly sample the dataset, optionally within groups (can be used to balance a dataset).

If you request a number of rows greater than the dataframe length, it will return the original dataframe instead.

Usage


The following are the step's expected inputs and outputs and their specific types.

Step signature
filter_sample(ds_in: dataset, {
    "param": value
}) -> (ds_out: dataset)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Example

This draws a sample of 12.000 random rows from the original dataset:

Example call (in recipe editor)
filter_sample(ds, {"n_samples": 12000}) -> (ds_sampled)
More examples

In the next example we keep only a random half of the dataset:

Example call (in recipe editor)
filter_sample(ds, {"n_samples": 0.5}) -> (ds_sampled)

And this draws a sample of 500 rows from each department identified in the original dataset (or the maximum if there are fewer than 500 rows for a particular department):

Example call (in recipe editor)
filter_sample(ds, {"n_samples": 500, "by": "department"}) -> (ds_sampled)

Inputs


ds_in: dataset

An input dataset to filter.

Outputs


ds_out: dataset

A new dataset containing a random sample of the original rows.

Parameters


n_samples: number | integer

Number of rows to sample. How many random rows to pick from the original dataset (without replacement). If the value is greater than 1, it will be interpreted as a count of desired rows. If it is smaller than 1, it will be interpreted as a proportion of the entire dataset.


by: string

Sample independently in these groups. If a column is specified here, the sampling will be applied separately within each group defined by the unique values in this column. Combining this with a count of rows to pick (rather than a proportion), allows this step to balance the dataset, leading to an (approximately) equal number of rows within each group.


seed: number | null

A value used to initialize the random number generator, making it deterministic (reproducible).