filter_sample

If you request a number of rows greater than the dataframe length, it will return the original dataframe instead.

Usage

The following examples show how the step can be used in a recipe.

Examples

This draws a sample of 12.000 random rows from the original dataset:

filter_sample(ds, {"n_samples": 12000}) -> (ds_sampled)

This draws a sample of 12.000 random rows from the original dataset:

filter_sample(ds, {"n_samples": 12000}) -> (ds_sampled)

In the next example we keep only a random half of the dataset:

filter_sample(ds, {"n_samples": 0.5}) -> (ds_sampled)

And this draws a sample of 500 rows from each department identified in the original dataset (or the maximum if there are fewer than 500 rows for a particular department):

filter_sample(ds, {"n_samples": 500, "by": "department"}) -> (ds_sampled)

General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.

filter_sample(ds_in: dataset, {
    "param": value,
    ...
}) -> (ds_out: dataset)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Inputs

Outputs

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

n_samples

[number, integer]

required

Number of rows to sample. How many random rows to pick from the original dataset (without replacement). If the value is greater than 1, it will be interpreted as a count of desired rows. If it is smaller than 1, it will be interpreted as a proportion of the entire dataset.

Options

{_}

number

number.

Values must be in the following range:

0 < {_} < 1

string (ds_in.column)

Sample independently in these groups. If a column is specified here, the sampling will be applied separately within each group defined by the unique values in this column. Combining this with a count of rows to pick (rather than a proportion), allows this step to balance the dataset, leading to an (approximately) equal number of rows within each group.

seed

[number, null]

A value used to initialize the random number generator, making it deterministic (reproducible).

Prepare

Report

Analyse

Usage

Inputs & Outputs

Configuration

Prepare

Report

Analyse

​Usage

​Inputs & Outputs

​Configuration

Usage

Inputs & Outputs

Configuration