filter_sample
Randomly sample the dataset, optionally within groups (can be used to balance a dataset).
If you request a number of rows greater than the dataframe length, it will return the original dataframe instead.
Usage
The following examples show how the step can be used in a recipe.
Examples
Examples
This draws a sample of 12.000 random rows from the original dataset:
This draws a sample of 12.000 random rows from the original dataset:
In the next example we keep only a random half of the dataset:
And this draws a sample of 500 rows from each department identified in the original dataset (or the maximum if there are fewer than 500 rows for a particular department):
General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.
Inputs & Outputs
The following are the inputs expected by the step and the outputs it produces. These are generally
columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Inputs
Inputs
An input dataset to filter.
Outputs
Outputs
A new dataset containing a random sample of the original rows.
Configuration
The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output)
.
Parameters
Parameters
Number of rows to sample. How many random rows to pick from the original dataset (without replacement). If the value is greater than 1, it will be interpreted as a count of desired rows. If it is smaller than 1, it will be interpreted as a proportion of the entire dataset.
Sample independently in these groups. If a column is specified here, the sampling will be applied separately within each group defined by the unique values in this column. Combining this with a count of rows to pick (rather than a proportion), allows this step to balance the dataset, leading to an (approximately) equal number of rows within each group.
A value used to initialize the random number generator, making it deterministic (reproducible).