This is a form of market basket analysis. It analyses items (products) that occur unusually frequent together in a set of transactions (baskets).

The step creates an association rule, such as A->B, between items A and B, if the presence of A makes the presence of B in the same session N times more likely.

For further details about the algorithm see e.g. association rule learning.

item_id
string
required

Name of column uniquely identifying all items/products.

session_id
[string, array[string]]
required

Name(s) of column(s) uniquely identifying all sessions/baskets/orders.

item_label
string
required

Column used to label items in a user-friendly manner.

itemset_min
integer
default:
"2"

Minimum size of itemsets to identify. E.g. an itemsize of 3 means association rules will have 2 antecedents (e.g. A, B) and 1 consequent (C), resulting in rules of the form (A, B) -> C. The step will currently generate only single items as consequents.

Values must be in the following range:

2 ≤ itemset_min ≤ 5
itemset_max
integer
default:
"3"

Maximum size of itemsets to identify. E.g. an itemsize of 3 means association rules will have 2 antecedents (e.g. A, B) and 1 consequent (C), resulting in rules of the form (A, B) -> C. The step will currently generate only single items as consequents.

Values must be in the following range:

2 ≤ itemset_max ≤ 5
min_support
[number, integer]
default:
"10"

Minimum Support. Minimum support of a rule antecedent. If it is < 1 it will be taken as a proportion. In any other case it will be expected as a positive integer representing the count. Create rule A->B only if A occurred in at least this many sessions.

min_confidence
number
default:
"20"

Minimum Confidence. Expressed as a percentage. Include link A->B only if B occurred in at least this percentage of sessions also containing A.

Values must be in the following range:

0 ≤ min_confidence ≤ 100
min_lift
[number, null]

Minimum Lift. Expressed as multipler/ratio. Include link A->B only if A makes the presence of B in the same sessions at least this many times more likely.

weight_metric
string
default:
"rule_lift_abs"

Metric for link weight. Which association rule metric to use as the weight of links in the network generated by this step.

Values must be one of the following:

itemset_support_abs itemset_support_pct antecedent_support_abs antecedent_support_pct consequent_support_abs consequent_support_pct rule_confidence_pct rule_lift_abs rule_lift_pct

Whether to link items to rules. Otherwise, a product (antecedent) will be linked only to other products (consequent).

Only keep N links with largest weight. This applies individually to each node in the network, filtering its outgoing links to keep only the first N by weight. The value of weights itself is selected using the weight_metric parameter, i.e. corresponds to one of the association rule metrics (support, confidence etc.). If null, all links will be kept.

item_aggregations
[object, null]

Definition of desired aggregations for (consequent) items. A dictionary mapping original columns to new aggregated columns, specifying an aggregation function for each. Aggregations are functions that reduce all the values in a particular column of a single item/product to a single summary value for that item/product. E.g. a sum aggregation of column A calculates a single total by adding up all the values in A belonging to each item.

Possible aggregations functions accepted as func parameters are:

  • n, size or count: calculate number of rows in group
  • sum: sum total of values
  • mean: take mean of values
  • max: take max of values
  • min: take min of values
  • first: take first item found
  • last: take last item found
  • unique: collect a list of unique values
  • n_unique: count the number of unique values
  • list: collect a list of all values
  • concatenate: convert all values to text and concatenate them into one long text
  • concat_lists: concatenate lists in all rows into a single larger list
  • count_where: number of rows in which the column matches a value, needs parameter value with the value that you want to count
  • percent_where: percentage of the column where the column matches a value, needs parameter value with the value that you want to count

Note that in the case of count_where and percent_where an additional value parameter is required.

rule_aggregations
[object, null]

Definition of desired aggregations for rules (all items in rule). A dictionary mapping original columns to new aggregated columns, specifying an aggregation function for each. Aggregations are functions that reduce all the values in a particular column of a single item/product to a single summary value for that item/product. E.g. a sum aggregation of column A calculates a single total by adding up all the values in A belonging to each item.

Possible aggregations functions accepted as func parameters are:

  • n, size or count: calculate number of rows in group
  • sum: sum total of values
  • mean: take mean of values
  • max: take max of values
  • min: take min of values
  • first: take first item found
  • last: take last item found
  • unique: collect a list of unique values
  • n_unique: count the number of unique values
  • list: collect a list of all values
  • concatenate: convert all values to text and concatenate them into one long text
  • concat_lists: concatenate lists in all rows into a single larger list
  • count_where: number of rows in which the column matches a value, needs parameter value with the value that you want to count
  • percent_where: percentage of the column where the column matches a value, needs parameter value with the value that you want to count

Note that in the case of count_where and percent_where an additional value parameter is required.