Skip to content

Join

Join two datasets on their row indexes or on values in specified columns.

I.e., the equivalent of a database join of two tables. Adds the columns from the second dataset (ds_right) to the first (ds_left). If the two datasets contain columns with identical names (other than those used to perform the join), configurable suffixes will be appended to their names in the resulting dataset (see suffixes parameter below).

The rows included in the result depend on the kind of join (see the how parameter below). Depending on whether it's a left, right, inner, or outer-join, may include rows from either dataset or both.

The join performed is always an equi-join, meaning that rows from the left are matched with rows from the right where their respective values in the join column (or indexes) are identical (e.g. where the value of column id on the left is equal to the value of column id on the right).

Also see Wikipedia's article on table joins to learn more about them.

Example

For example, to enrich a dataset containing employees with information about their department:

join(employees, departments, {
  "how": "left",
  "left": "department_id",
  "right": "id"
}) -> (employees_dep)

Usage

The following are the step's expected inputs and outputs and their specific types.

join(
    ds_left: dataset,
    ds_right: dataset, 
    {
        "param": value
    }
) -> (result: dataset)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Inputs


ds_left: dataset

A dataset to join, referred to as "left" by the how parameter.


ds_right: dataset

A second dataset to be joined with the first, referred to as "right" by the how parameter.

Outputs


result: dataset

The result of the join. Contains the columns of both input datasets. Columns with identical names in both datasets (if not used as the column joined on) will have their names concatenated with a suffix.

Parameters


how: string = "left"

Type of join. Analogous to SQL joins of the same name:

  • inner: "inner join". Only keeps rows where values in the join column exist (match) in both dataset (effectively an intersection). Since it is guaranteed that each row has values from the left and right dataset, no new missing values (NaN) will be introduced.
  • left: "left (outer) join". Keep all rows from the left dataset, adding values from columns in the right dataset where rows match (in the join column). If no row on the right matches, the values of columns from the right will contain NaN values only.
  • right: "right (outer) join". Keep all rows from the right dataset, adding values from columns in the left dataset where rows match (in the join column). If no row on the left matches, the values of columns from the left will contain NaN values only.
  • outer: "full (outer) join". Keep all rows from both datasets. If a value in the join column doesn't match on either side, the values of columns from the non-matching side will be NaN.

The most typical scenario is probably: "Keep my primary dataset left and add whatever information you can from dataset right". If this is the case, you'll want a left-join.

Must be one of: "left", "right", "outer", "inner"


left: string | null

Column or index in the left dataset whose values will be matched against the right. To use the row index instead of a column use "_index_", null, or simply omit this parameter.

Must be one of: "_index_", None


right: string | null

Column or index in the right dataset whose values will be matched against the left. To use the row index instead of a column use "_index_", null, or simply omit this parameter.

Must be one of: "_index_", None


suffixes: array[string] = ['_x', '_y']

Column name suffixes. Will be appended to any original column (name) that occurs in both datasets (other than the join columns themselves).