> ## Documentation Index
> Fetch the complete documentation index at: https://docs.graphext.com/llms.txt
> Use this file to discover all available pages before exploring further.

# join

> Join two datasets on their row indexes or on values in specified columns. 

I.e., the equivalent of a [database join](https://en.wikipedia.org/wiki/Join_\(SQL\)) of two tables.
Adds the columns from the second dataset (`ds_right`) to the first (`ds_left`). If the two datasets contain columns
with identical names (other than those used to perform the join), configurable suffixes will be appended to their names
in the resulting dataset (see `suffixes` parameter below).

The rows included in the result depend on the kind of join (see the `how` parameter below). Depending on whether
it's a left, right, inner, or outer-join, may include rows from either dataset or both.

The join performed is always an [equi-join](https://en.wikipedia.org/wiki/Join_\(SQL\)#Equi-join), meaning that rows
from the left are matched with rows from the right where their respective values in the join column (or indexes)
are *identical* (e.g. where the value of column `id` on the left is equal to the value of column `id` on the right).

Also see [Wikipedia's article on table joins](https://en.wikipedia.org/wiki/Join_\(SQL\)) to learn more about them.

## Usage

The following example shows how the step can be used in a recipe.

<Accordion title="Examples" icon="code" defaultOpen="true">
  <Tabs>
    <Tab title="Example 1">
      For example, to enrich a dataset containing employees with information about their department:

      ```stan theme={null}
      join(employees, departments, {
        "how": "left",
        "left": "department_id",
        "right": "id"
      }) -> (employees_dep)
      ```
    </Tab>

    <Tab title="Signature">
      General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.

      ```stan theme={null}
      join(ds_left: dataset, ds_right: dataset, {
          "param": value,
          ...
      }) -> (result: dataset)
      ```
    </Tab>
  </Tabs>
</Accordion>

## Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally
columns (`ds.first_name`), datasets (`ds` or `ds[["first_name", "last_name"]]`) or models (referenced
by name e.g. `"churn-clf"`).

<Accordion title="Inputs" icon="right-to-bracket">
  <ParamField path="ds_left" type="dataset" required>
    A dataset to join, referred to as "left" by the `how` parameter.
  </ParamField>

  <ParamField path="ds_right" type="dataset" required>
    A second dataset to be joined with the first, referred to as "right" by the `how` parameter.
  </ParamField>
</Accordion>

<Accordion title="Outputs" icon="right-from-bracket">
  <ParamField path="result" type="dataset" required>
    The result of the join. Contains the columns of both input datasets. Columns with identical names in both datasets
    (if not used as the column joined on) will have their names concatenated with a suffix.
  </ParamField>
</Accordion>

## Configuration

The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last "input" to the step, i.e. `step(..., {"param": "value", ...}) -> (output)`.

<Accordion title="Parameters" defaultOpen="true" icon="sliders">
  <ParamField path="how" type="string" default="left">
    Type of join.
    Analogous to SQL joins of the same name:

    * inner: ["inner join"](https://en.wikipedia.org/wiki/Join_\(SQL\)#Inner_join). Only keeps rows where values in the
      join column exist (match) in both dataset (effectively an intersection). Since it is guaranteed that each row
      has values from the left and right dataset, no new missing values (NaN) will be introduced.
    * left: ["left (outer) join"](https://en.wikipedia.org/wiki/Join_\(SQL\)#Left_outer_join). Keep all rows from the
      *left* dataset, adding values from columns in the *right* dataset where rows match (in the join column). If no
      row on the right matches, the values of columns from the right will contain NaN values only.
    * right: ["right (outer) join"](https://en.wikipedia.org/wiki/Join_\(SQL\)#Right_outer_join). Keep all rows from the
      *right* dataset, adding values from columns in the *left* dataset where rows match (in the join column). If no
      row on the left matches, the values of columns from the left will contain NaN values only.
    * outer: ["full (outer) join"](https://en.wikipedia.org/wiki/Join_\(SQL\)#Full_outer_join). Keep all rows from both
      datasets. If a value in the join column doesn't match on either side, the values of columns from the non-matching
      side will be NaN.

    The most typical scenario is probably: "Keep my primary dataset `left` and add whatever information you can from
    dataset `right`". If this is the case, you'll want a `left`-join.

    Values must be one of the following:

    * `left`
    * `right`
    * `outer`
    * `inner`
  </ParamField>

  <ParamField path="left" type="[string, null]">
    Column or index in the left dataset whose values will be matched against the right.
    To use the row index instead of a column use `"_index_"`, `null`, or simply omit this parameter.

    <Accordion title="Options" />
  </ParamField>

  <ParamField path="right" type="[string, null]">
    Column or index in the right dataset whose values will be matched against the left.
    To use the row index instead of a column use `"_index_"`, `null`, or simply omit this parameter.

    <Accordion title="Options" />
  </ParamField>

  <ParamField path="suffixes" type="array[string]" default="['_x', '_y']">
    Column name suffixes.
    Will be appended to any original column (name) that occurs in both datasets (other than the join columns themselves).

    <Accordion title="Array items">
      <ParamField path="Item" type="string">
        Each item in array.
      </ParamField>
    </Accordion>
  </ParamField>
</Accordion>
