Association rules¶
network • basket analysis • association rules • market basket • itemset
Calculate association rules for a items/products in a dataset of transactions.
This is a form of market basket analysis. It analyses items (products) that occur unusually frequent together in a set of transactions (baskets).
The step creates an association rule, such as A->B, between items A and B, if the presence of A makes the presence of B in the same session N times more likely.
For further details about the algorithm see e.g. association rule learning.
Usage¶
The following are the step's expected inputs and outputs and their specific types.
association_rules(transactions: dataset, {"param": value}) -> (rules: dataset)
where the object {"param": value}
is optional in most cases and if present may contain any of the parameters described in the
corresponding section below.
Example¶
The following call creates rules between pairs of items A and B, if:
- A occurs in at least 7 sessions
- B occurs in at least 25% of sessions containing A
- The presence of A in a session makes the presence of B in the same session at least twice as likely.
Note that the last condition is equivalent to saying that the overall frequency of B in all sessions must be less than 12.5% (half of 25%). In other words, a minimum lift of 2 means that the frequency of B, in sessions already containing A, must be twice the background frequency of B in general.
As an example, the percentage of shopping baskets containing milk (item B) may be 10%. However, amongst those baskets already containing cereals, the percentage containing milk is likely to be higher. If milk occured e.g. in 30% of baskets also having cereals, than the lift of the rule cereal->milk would be 3. The buying of cereal make the buying of milk 3 times more likely.
association_rules(transactions, {
"item_id": "product_id",
"session_id": "order_id",
"min_support": 7
"min_confidence": 25
"min_lift": 2
}) -> (rules)
Inputs¶
transactions: dataset
A long input dataset with one row per item (product) and session (basket). In other words, sessions or baskets should be dis_aggregated, but each row should uniquely identify the item/product _and session/basket by id or name.
Outputs¶
rules: dataset
A new output dataset containing products and rules, connected into a network such that products are linked to the association rules in which they occur.
Parameters¶
item_id: string
Name of column uniquely identifying all items/products.
session_id: string | array
Name(s) of column(s) uniquely identifying all sessions/baskets/orders.
item_label: string
Column used to label items in a user-friendly manner.
itemset_min: integer = 2
Minimum size of itemsets to identify. E.g. an itemsize of 3 means association rules will have 2 antecedents (e.g. A, B) and 1 consequent (C), resulting in rules of the form (A, B) -> C. The step will currently generate only single items as consequents.
Range: 2 ≤ itemset_min ≤ 5
itemset_max: integer = 3
Maximum size of itemsets to identify. E.g. an itemsize of 3 means association rules will have 2 antecedents (e.g. A, B) and 1 consequent (C), resulting in rules of the form (A, B) -> C. The step will currently generate only single items as consequents.
Range: 2 ≤ itemset_max ≤ 5
min_support: number | integer = 10
Minimum Support. Minimum support of a rule antecedent. If it is < 1 it will be taken as a proportion. In any other case it will be expected as a positive integer representing the count. Create rule A->B only if A occurred in at least this many sessions.
min_confidence: number = 20
Minimum Confidence. Expressed as a percentage. Include link A->B only if B occurred in at least this percentage of sessions also containing A.
Range: 0 ≤ min_confidence ≤ 100
min_lift: number | null
Minimum Lift. Expressed as multipler/ratio. Include link A->B only if A makes the presence of B in the same sessions at least this many times more likely.
weight_metric: string = "rule_lift_abs"
Metric for link weight. Which association rule metric to use as the weight of links in the network generated by this step.
Must be one of:
"itemset_support_abs"
,
"itemset_support_pct"
,
"antecedent_support_abs"
,
"antecedent_support_pct"
,
"consequent_support_abs"
,
"consequent_support_pct"
,
"rule_confidence_pct"
,
"rule_lift_abs"
,
"rule_lift_pct"
link_rules: boolean = True
Whether to link items to rules. Otherwise, a product (antecedent) will be linked only to other products (consequent).
link_top_n: integer | null
Only keep N links with largest weight. This applies individually to each node in the network, filtering its outgoing links to keep only
the first N by weight. The value of weights itself is selected using the weight_metric
parameter,
i.e. corresponds to one of the association rule metrics (support, confidence etc.). If null
,
all links will be kept.
item_aggregations: object | null
Definition of desired aggregations for (consequent) items. A dictionary mapping original columns to new aggregated columns, specifying an aggregation function for each.
Aggregations are functions that reduce all the values in a particular column of a single item/product
to a single summary value for that item/product. E.g. a sum
aggregation of column A calculates a single
total by adding up all the values in A belonging to each item.
Possible aggregations functions accepted as func
parameters are:
n
,size
orcount
: calculate number of rows in groupsum
: sum total of valuesmean
: take mean of valuesmax
: take max of valuesmin
: take min of valuesfirst
: take first item foundlast
: take last item foundunique
: collect a list of unique valuesn_unique
: count the number of unique valueslist
: collect a list of all valuesconcatenate
: convert all values to text and concatenate them into one long textconcat_lists
: concatenate lists in all rows into a single larger listcount_where
: number of rows in which the column matches a value, needs parametervalue
with the value that you want to countpercent_where
: percentage of the column where the column matches a value, needs parametervalue
with the value that you want to count
Note that in the case of count_where
and percent_where
an additional value
parameter is required.
Items in item_aggregations
rule_aggregations: object | null
Definition of desired aggregations for rules (all items in rule). A dictionary mapping original columns to new aggregated columns, specifying an aggregation function for each.
Aggregations are functions that reduce all the values in a particular column of a single item/product
to a single summary value for that item/product. E.g. a sum
aggregation of column A calculates a single
total by adding up all the values in A belonging to each item.
Possible aggregations functions accepted as func
parameters are:
n
,size
orcount
: calculate number of rows in groupsum
: sum total of valuesmean
: take mean of valuesmax
: take max of valuesmin
: take min of valuesfirst
: take first item foundlast
: take last item foundunique
: collect a list of unique valuesn_unique
: count the number of unique valueslist
: collect a list of all valuesconcatenate
: convert all values to text and concatenate them into one long textconcat_lists
: concatenate lists in all rows into a single larger listcount_where
: number of rows in which the column matches a value, needs parametervalue
with the value that you want to countpercent_where
: percentage of the column where the column matches a value, needs parametervalue
with the value that you want to count
Note that in the case of count_where
and percent_where
an additional value
parameter is required.