> ## Documentation Index
> Fetch the complete documentation index at: https://docs.graphext.com/llms.txt
> Use this file to discover all available pages before exploring further.

# aggregate_list_items

> Group a dataset by elements in a column of lists and aggregate remaining columns using one or more predefined functions. 

This is essentially `aggregate` after "exploding" a column of lists such that each list item has its
own row. By default the step produces one row per unique list item, and two columns: the `count` of how many
times each list item was encountered, and a column `rows` recording the row numbers of the lists in which the
element was found (\[1,3,7] would mean an item was present in the lists of rows 1, 3 and 7). In addition,
predefined functions can be used to add further aggregations of the grouped input dataset.

For example, if a dataset contains texts already separated into lists of individual words, this step will create a
new dataset containing one row per word, a column containing each word's frequency (count) across all texts, and
another column of lists indicating in which rows the word was found.

Optionally, if a grouping column is specified using the `"by"` parameter, otherwise identical items belonging to
different groups will be counted separately. If the dataset contains texts in different languages, for example, one
may not want to group all occurences of the same word together, irrespective of language. The word "angel"
in German signifies a fishing rod, for example, "any" in Catalan means "year", and the Italian word "burro" means
"butter" while in Spanish it refers to "donkey". Using language as the grouping column would preserve the word in each
language as a separate group.

## Usage

The following example shows how the step can be used in a recipe.

<Accordion title="Examples" icon="code" defaultOpen="true">
  <Tabs>
    <Tab title="Example 1">
      The following example performs a simple word-count, returning a new dataset with one row per word and
      each word's frequency in the "count" column. The aggregation will be performed separately for each language.
      Also, for each word, a custom aggregation collects the dates of the texts in which the word was mentioned:

      ```stan theme={null}
      aggregate_list_items(ds_in, {
        "split_column": "words",
        "by": "language",
        "unique_rows": true,
        "aggregations": {
          "text_publication_date": {
            "mention_dates": {"func": "list"},
          }
        }
      }) -> (ds_out)
      ```
    </Tab>

    <Tab title="Signature">
      General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.

      ```stan theme={null}
      aggregate_list_items(ds_in: dataset, {
          "param": value,
          ...
      }) -> (ds_out: dataset)
      ```
    </Tab>
  </Tabs>
</Accordion>

## Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally
columns (`ds.first_name`), datasets (`ds` or `ds[["first_name", "last_name"]]`) or models (referenced
by name e.g. `"churn-clf"`).

<Accordion title="Inputs" icon="right-to-bracket">
  <ParamField path="ds_in" type="dataset" required>
    An input dataset containing at least one column with lists of elements to group.
  </ParamField>
</Accordion>

<Accordion title="Outputs" icon="right-from-bracket">
  <ParamField path="ds_out" type="dataset" required>
    The result of the aggregation. Contains one row per unique element in original column of lists.
  </ParamField>
</Accordion>

## Configuration

The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last "input" to the step, i.e. `step(..., {"param": "value", ...}) -> (output)`.

<Accordion title="Parameters" defaultOpen="true" icon="sliders">
  <ParamField path="split_column" type="string (ds_in.column:list)" required>
    Name of column containing the lists to be split and grouped.
  </ParamField>

  <ParamField path="by" type="[string, null]">
    Optional grouping column to use for item counting and aggregation.
  </ParamField>

  <ParamField path="unique_rows" type="boolean" default="false">
    Count unique occurences only.
    Whether to collect in the output column "rows" only the unique rows each item appeared in,
    or all rows (duplicate row IDs if item appeared more than once in a single row).
  </ParamField>

  <ParamField path="rows_as_str" type="boolean" default="false">
    Row IDs as strings.
    Output occurrence of items in rows as lists of strings (categorical) rather than lists of row numbers.
  </ParamField>

  <ParamField path="aggregations" type="object">
    Definition of additional aggregations.
    A dictionary mapping original columns to new aggregated columns, specifying an aggregation function for each.
    *Aggregations* are functions that reduce all the values in a particular column of a single group to a single summary value of that group.
    E.g. a `sum` aggregation of column A calculates a single total by adding up all the values in A belonging to each group.

    Possible aggregations functions accepted as `func` parameters are:

    * `n`, `size` or `count`: calculate number of rows in group
    * `sum`: sum total of values
    * `mean`: take mean of values
    * `max`: take max of values
    * `min`: take min of values
    * `mode`: find most frequent value (returns first mode if multiple exist)
    * `first`: take first item found
    * `last`: take last item found
    * `unique`: collect a list of unique values
    * `n_unique`: count the number of unique values
    * `list`: collect a list of all values
    * `concatenate`: convert all values to text and concatenate them into one long text
    * `concat_lists`: concatenate lists in all rows into a single larger list
    * `count_where`: number of rows in which the column matches a value, needs parameter `value` with the value that you want to count
    * `percent_where`: percentage of the column where the column matches a value, needs parameter `value` with the value that you want to count

    Note that in the case of `count_where` and `percent_where` an additional `value` parameter is required.

    <Accordion title="Item properties">
      <ParamField path="input_aggregations" type="object">
        One item per input column.
        Each key should be the name of an input column, and each value an object defining one or more aggregations for that column.
        An individual aggregation consists of the name of a desired output column, mapped to a specific aggregation function.
        For example:

        ```json theme={null}
        {
        "input_col": {
        "output_col": {"func": "sum"}
        }
        }
        ```

        <Accordion title="Item properties">
          <ParamField path="aggregation_func" type="object">
            Object defining how to aggregate a single output column.
            Needs at least the `"func"` parameter. If the aggregation function accepts further arguments,
            like the `"value"` parameter in case of `count_where` and `percent_where`, these need to be provided also.
            For example:

            ```json theme={null}
            {
            "output_col": {"func": "count_where", "value": 2}
            }
            ```

            <Accordion title="Properties">
              <ParamField path="func" type="string">
                Aggregation function.

                Values must be one of the following:

                `n` `size` `count` `sum` `mean` `n_unique` `count_where` `percent_where` `concatenate` `max` `min` `first` `last` `mode` `concat_lists` `unique` `list`
              </ParamField>
            </Accordion>
          </ParamField>
        </Accordion>
      </ParamField>
    </Accordion>

    <Accordion title="Examples">
      * Including an aggregation function with additional parameters:

      ```json theme={null}
      {
      "product_id": {
      "products": {"func": "list"},
      "size": {"func": "count"}
      },
      "item_total": {
      "total": {"func": "sum"},
      },
      "item_category": {
      "num_food_items": {"func": "count_where", "value": "food"}
      }
      }
      ```
    </Accordion>
  </ParamField>
</Accordion>
