Skip to main content
This is a form of market basket analysis. It analyses items (products) that occur unusually frequent together in a set of transactions (baskets). The step creates an association rule, such as A->B, between items A and B, if the presence of A makes the presence of B in the same session N times more likely. For further details about the algorithm see e.g. association rule learning.

Usage

The following example shows how the step can be used in a recipe.

Examples

  • Example 1
  • Signature
The following call creates rules between pairs of items A and B, if:
  • A occurs in at least 7 sessions
  • B occurs in at least 25% of sessions containing A
  • The presence of A in a session makes the presence of B in the same session at least twice as likely.
Note that the last condition is equivalent to saying that the overall frequency of B in all sessions must be less than 12.5% (half of 25%). In other words, a minimum lift of 2 means that the frequency of B, in sessions already containing A, must be twice the background frequency of B in general.As an example, the percentage of shopping baskets containing milk (item B) may be 10%. However, amongst those baskets already containing cereals, the percentage containing milk is likely to be higher. If milk occured e.g. in 30% of baskets also having cereals, than the lift of the rule cereal->milk would be 3. The buying of cereal make the buying of milk 3 times more likely.
association_rules(transactions, {
  "item_id": "product_id",
  "session_id": "order_id",
  "min_support": 7
  "min_confidence": 25
  "min_lift": 2
}) -> (rules)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").
transactions
dataset
required
A long input dataset with one row per item (product) and session (basket). In other words, sessions or baskets should be _dis_aggregated, but each row should uniquely identify the item/product and session/basket by id or name.
rules
dataset
required
A new output dataset containing products and rules, connected into a network such that products are linked to the association rules in which they occur.

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

item_id
string (transactions.column)
required
Name of column uniquely identifying all items/products.
session_id
[string, array[string]]
required
Name(s) of column(s) uniquely identifying all sessions/baskets/orders.
Item
string (transactions.column)
Each item in array.
item_label
string (transactions.column)
required
Column used to label items in a user-friendly manner.
itemset_min
integer
default:"2"
Minimum size of itemsets to identify. E.g. an itemsize of 3 means association rules will have 2 antecedents (e.g. A, B) and 1 consequent (C), resulting in rules of the form (A, B) -> C. The step will currently generate only single items as consequents.Values must be in the following range:
2itemset_min5
itemset_max
integer
default:"3"
Maximum size of itemsets to identify. E.g. an itemsize of 3 means association rules will have 2 antecedents (e.g. A, B) and 1 consequent (C), resulting in rules of the form (A, B) -> C. The step will currently generate only single items as consequents.Values must be in the following range:
2itemset_max5
min_support
[number, integer]
default:"10"
Minimum Support. Minimum support of a rule antecedent. If it is < 1 it will be taken as a proportion. In any other case it will be expected as a positive integer representing the count. Create rule A->B only if A occurred in at least this many sessions.
  • number
  • integer
{_}
number
number.Values must be in the following range:
0 < {_} < 1
min_confidence
number
default:"20"
Minimum Confidence. Expressed as a percentage. Include link A->B only if B occurred in at least this percentage of sessions also containing A.Values must be in the following range:
0min_confidence100
min_lift
[number, null]
Minimum Lift. Expressed as multipler/ratio. Include link A->B only if A makes the presence of B in the same sessions at least this many times more likely.
weight_metric
string
default:"rule_lift_abs"
Metric for link weight. Which association rule metric to use as the weight of links in the network generated by this step.Values must be one of the following:itemset_support_abs itemset_support_pct antecedent_support_abs antecedent_support_pct consequent_support_abs consequent_support_pct rule_confidence_pct rule_lift_abs rule_lift_pct
Whether to link items to rules. Otherwise, a product (antecedent) will be linked only to other products (consequent).
Only keep N links with largest weight. This applies individually to each node in the network, filtering its outgoing links to keep only the first N by weight. The value of weights itself is selected using the weight_metric parameter, i.e. corresponds to one of the association rule metrics (support, confidence etc.). If null, all links will be kept.
item_aggregations
[object, null]
Definition of desired aggregations for (consequent) items. A dictionary mapping original columns to new aggregated columns, specifying an aggregation function for each. Aggregations are functions that reduce all the values in a particular column of a single item/product to a single summary value for that item/product. E.g. a sum aggregation of column A calculates a single total by adding up all the values in A belonging to each item.Possible aggregations functions accepted as func parameters are:
  • n, size or count: calculate number of rows in group
  • sum: sum total of values
  • mean: take mean of values
  • max: take max of values
  • min: take min of values
  • first: take first item found
  • last: take last item found
  • unique: collect a list of unique values
  • n_unique: count the number of unique values
  • list: collect a list of all values
  • concatenate: convert all values to text and concatenate them into one long text
  • concat_lists: concatenate lists in all rows into a single larger list
  • count_where: number of rows in which the column matches a value, needs parameter value with the value that you want to count
  • percent_where: percentage of the column where the column matches a value, needs parameter value with the value that you want to count
Note that in the case of count_where and percent_where an additional value parameter is required.
rule_aggregations
[object, null]
Definition of desired aggregations for rules (all items in rule). A dictionary mapping original columns to new aggregated columns, specifying an aggregation function for each. Aggregations are functions that reduce all the values in a particular column of a single item/product to a single summary value for that item/product. E.g. a sum aggregation of column A calculates a single total by adding up all the values in A belonging to each item.Possible aggregations functions accepted as func parameters are:
  • n, size or count: calculate number of rows in group
  • sum: sum total of values
  • mean: take mean of values
  • max: take max of values
  • min: take min of values
  • first: take first item found
  • last: take last item found
  • unique: collect a list of unique values
  • n_unique: count the number of unique values
  • list: collect a list of all values
  • concatenate: convert all values to text and concatenate them into one long text
  • concat_lists: concatenate lists in all rows into a single larger list
  • count_where: number of rows in which the column matches a value, needs parameter value with the value that you want to count
  • percent_where: percentage of the column where the column matches a value, needs parameter value with the value that you want to count
Note that in the case of count_where and percent_where an additional value parameter is required.
I