This is essentially aggregate after “exploding” a column of lists such that each list item has its own row. By default the step produces one row per unique list item, and two columns: the count of how many times each list item was encountered, and a column rows recording the row numbers of the lists in which the element was found ([1,3,7] would mean an item was present in the lists of rows 1, 3 and 7). In addition, predefined functions can be used to add further aggregations of the grouped input dataset.

For example, if a dataset contains texts already separated into lists of individual words, this step will create a new dataset containing one row per word, a column containing each word’s frequency (count) across all texts, and another column of lists indicating in which rows the word was found.

Optionally, if a grouping column is specified using the "by" parameter, otherwise identical items belonging to different groups will be counted separately. If the dataset contains texts in different languages, for example, one may not want to group all occurences of the same word together, irrespective of language. The word “angel” in German signifies a fishing rod, for example, “any” in Catalan means “year”, and the Italian word “burro” means “butter” while in Spanish it refers to “donkey”. Using language as the grouping column would preserve the word in each language as a separate group.

split_column
string
required

Name of column containing the lists to be split and grouped.

by
[string, null]

Optional grouping column to use for item counting and aggregation.

unique_rows
boolean

Count unique occurences only. Whether to collect in the output column “rows” only the unique rows each item appeared in, or all rows (duplicate row IDs if item appeared more than once in a single row).

rows_as_str
boolean

Row IDs as strings. Output occurrence of items in rows as lists of strings (categorical) rather than lists of row numbers.

aggregations
object

Definition of additional aggregations. A dictionary mapping original columns to new aggregated columns, specifying an aggregation function for each. Aggregations are functions that reduce all the values in a particular column of a single group to a single summary value of that group. E.g. a sum aggregation of column A calculates a single total by adding up all the values in A belonging to each group.

Possible aggregations functions accepted as func parameters are:

  • n, size or count: calculate number of rows in group
  • sum: sum total of values
  • mean: take mean of values
  • max: take max of values
  • min: take min of values
  • first: take first item found
  • last: take last item found
  • unique: collect a list of unique values
  • n_unique: count the number of unique values
  • list: collect a list of all values
  • concatenate: convert all values to text and concatenate them into one long text
  • concat_lists: concatenate lists in all rows into a single larger list
  • count_where: number of rows in which the column matches a value, needs parameter value with the value that you want to count
  • percent_where: percentage of the column where the column matches a value, needs parameter value with the value that you want to count

Note that in the case of count_where and percent_where an additional value parameter is required.