Filter sample¶
Randomly sample the dataset, optionally within groups (can be used to balance a dataset).
If you request a number of rows greater than the dataframe length, it will return the original dataframe instead.
Usage¶
The following are the step's expected inputs and outputs and their specific types.
filter_sample(ds_in: dataset, {"param": value}) -> (ds_out: dataset)
where the object {"param": value}
is optional in most cases and if present may contain any of the parameters described in the
corresponding section below.
Example¶
This draws a sample of 12.000 random rows from the original dataset:
filter_sample(ds, {"n_samples": 12000}) -> (ds_sampled)
More examples
In the next example we keep only a random half of the dataset:
filter_sample(ds, {"n_samples": 0.5}) -> (ds_sampled)
And this draws a sample of 500 rows from each department identified in the original dataset (or the maximum if there are fewer than 500 rows for a particular department):
filter_sample(ds, {"n_samples": 500, "by": "department"}) -> (ds_sampled)
Inputs¶
ds_in: dataset
An input dataset to filter.
Outputs¶
ds_out: dataset
A new dataset containing a random sample of the original rows.
Parameters¶
n_samples: number | integer
Number of rows to sample. How many random rows to pick from the original dataset (without replacement). If the value is greater than 1, it will be interpreted as a count of desired rows. If it is smaller than 1, it will be interpreted as a proportion of the entire dataset.
by: string
Sample independently in these groups. If a column is specified here, the sampling will be applied separately within each group defined by the unique values in this column. Combining this with a count of rows to pick (rather than a proportion), allows this step to balance the dataset, leading to an (approximately) equal number of rows within each group.
seed: number | null
A value used to initialize the random number generator, making it deterministic (reproducible).