filter_sample
Randomly sample the dataset, optionally within groups (can be used to balance a dataset).
If you request a number of rows greater than the dataframe length, it will return the original dataframe instead.
Usage
The following examples show how the step can be used in a recipe.
This draws a sample of 12.000 random rows from the original dataset:
Inputs & Outputs
The following are the inputs expected by the step and the outputs it produces. These are generally
columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Configuration
The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output)
.
Number of rows to sample. How many random rows to pick from the original dataset (without replacement). If the value is greater than 1, it will be interpreted as a count of desired rows. If it is smaller than 1, it will be interpreted as a proportion of the entire dataset.
Sample independently in these groups. If a column is specified here, the sampling will be applied separately within each group defined by the unique values in this column. Combining this with a count of rows to pick (rather than a proportion), allows this step to balance the dataset, leading to an (approximately) equal number of rows within each group.
A value used to initialize the random number generator, making it deterministic (reproducible).
Was this page helpful?