Skip to content

Replace missing

fast step  missing values • NaN

Replace missing values (NaNs) with either a specified constant value or the result of a given function.

Usage


The following are the step's expected inputs and outputs and their specific types.

Step signature
replace_missing(input: column, {
    "param": value
}) -> (output: column)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Example

The following configuration fills all missing values with the string "unknown":

Example call (in recipe editor)
replace_missing(ds.occupation, {"value": "unknown"}) -> (ds.occupation_filled)
More examples

The following configuration fills all missing values with the maximum of the column:

Example call (in recipe editor)
replace_missing(ds.numbers, {"function": "max"}) -> (ds.numbers_filled)

Inputs


input: column

An arbitrary column, potentially containing missing values (NaN).

Outputs


output: column

A copy of the input column where missing values have been replaced by a constant.

Parameters


value: number | string | array

The constant to use to fill in missing values (normally of same type as original column). Can be a scalar value (with number or string type) or an array of values (number or string). If an array is passed it should have at least one item.


function: string

Fill missing values with the result of a given function. The following functions can be used:

  • max: substitutes the NaN values with the maximum value of a numerical column.
  • min: substitutes the NaN values with the minimum value of a numerical column.
  • mean: substitutes the NaN values with the mean of a numerical column.
  • median: substitutes the NaN values with the median of a numerical column.
  • least_freq: substitutes the NaN values with the least frequent value of a column.
  • most_freq: substitutes the NaN values with the most frequent value of a column.
  • alphabetical_first: substitutes the NaN values with the alphabetically first value of a categorical column.
  • alphabetical_first: substitutes the NaN values with the alphabetically last value of a categorical column.
  • bfill: for each NaN value, uses the next valid observation to fill it.
  • ffill: for each NaN value, propagates the last valid observation forward to fill it.

Must be one of: "max", "min", "mean", "median", "least_freq", "most_freq", "alphabetical_first", "alphabetical_last", "bfill", "ffill"