# Association rules¶

network • basket analysis • association rules • market basket • itemset

Calculate association rules for a items/products in a dataset of transactions.

This is a form of market basket analysis. It analyses items (products) that occur unusually frequent together in a set of transactions (baskets).

The step creates an association rule, such as A->B, between items A and B, if the presence of A makes the presence of B in the same session N times more likely.

For further details about the algorithm see e.g. association rule learning.

## Usage¶

The following are the step's expected inputs and outputs and their specific types.

```
association_rules(transactions: dataset, {
"param": value
}) -> (rules: dataset)
```

where the object `{"param": value}`

is optional in most cases and if present may contain any of the parameters described in the
corresponding section below.

#### Example¶

The following call creates rules between pairs of items A and B, if:

- A occurs in at least 7 sessions
- B occurs in at least 25% of sessions containing A
- The presence of A in a session makes the presence of B in the same session at least twice as likely.

Note that the last condition is equivalent to saying that the overall frequency of B in all sessions must be less than 12.5% (half of 25%). In other words, a minimum lift of 2 means that the frequency of B, in sessions already containing A, must be twice the background frequency of B in general.

As an example, the percentage of shopping baskets containing milk (item B) may be 10%. However, amongst those baskets already containing cereals, the percentage containing milk is likely to be higher. If milk occured e.g. in 30% of baskets also having cereals, than the lift of the rule cereal->milk would be 3. The buying of cereal make the buying of milk 3 times more likely.

```
association_rules(transactions, {
"item_id": "product_id",
"session_id": "order_id",
"min_support": 7
"min_confidence": 25
"min_lift": 2
}) -> (rules)
```

## Inputs¶

transactions: dataset

A long input dataset with one row per item (product) and session (basket). In other words, sessions
or baskets should be *dis_aggregated, but each row should uniquely identify the item/product _and*
session/basket by id or name.

## Outputs¶

rules: dataset

A new output dataset containing products and rules, connected into a network such that products are linked to the association rules in which they occur.

## Parameters¶

item_id: string

Name of column uniquely identifying all items/products.

session_id: string | array

Name(s) of column(s) uniquely identifying all sessions/baskets/orders.

item_label: string

Column used to label items in a user-friendly manner.

itemset_min: integer = 2

Minimum size of itemsets to identify. E.g. an itemsize of 3 means association rules will have 2 antecedents (e.g. A, B) and 1 consequent (C), resulting in rules of the form (A, B) -> C. The step will currently generate only single items as consequents.

Range: `2 ≤ itemset_min ≤ 5`

itemset_max: integer = 3

Maximum size of itemsets to identify. E.g. an itemsize of 3 means association rules will have 2 antecedents (e.g. A, B) and 1 consequent (C), resulting in rules of the form (A, B) -> C. The step will currently generate only single items as consequents.

Range: `2 ≤ itemset_max ≤ 5`

min_support: number | integer = 10

Minimum Support. Minimum support of a rule antecedent. If it is < 1 it will be taken as a proportion. In any other case it will be expected as a positive integer representing the count. Create rule A->B only if A occurred in at least this many sessions.

min_confidence: number = 20

Minimum Confidence. Expressed as a percentage. Include link A->B only if B occurred in at least this percentage of sessions also containing A.

Range: `0 ≤ min_confidence ≤ 100`

min_lift: number | null

Minimum Lift. Expressed as multipler/ratio. Include link A->B only if A makes the presence of B in the same sessions at least this many times more likely.

weight_metric: string = "rule_lift_abs"

Metric for link weight. Which association rule metric to use as the weight of links in the network generated by this step.

Must be one of:
`"itemset_support_abs"`

,
`"itemset_support_pct"`

,
`"antecedent_support_abs"`

,
`"antecedent_support_pct"`

,
`"consequent_support_abs"`

,
`"consequent_support_pct"`

,
`"rule_confidence_pct"`

,
`"rule_lift_abs"`

,
`"rule_lift_pct"`

link_rules: boolean = True

Whether to link items to rules. Otherwise, a product (antecedent) will be linked only to other products (consequent).

link_top_n: integer | null

Only keep N links with largest weight. This applies individually to each node in the network, filtering its outgoing links to keep only
the first N by weight. The value of weights itself is selected using the `weight_metric`

parameter,
i.e. corresponds to one of the association rule metrics (support, confidence etc.). If `null`

,
all links will be kept.

item_aggregations: object | null

Definition of desired aggregations for (consequent) items. A dictionary mapping original columns to new aggregated columns, specifying an aggregation function for each.
*Aggregations* are functions that reduce all the values in a particular column of a single item/product
to a single summary value for that item/product. E.g. a `sum`

aggregation of column A calculates a single
total by adding up all the values in A belonging to each item.

Possible aggregations functions accepted as `func`

parameters are:

`n`

,`size`

or`count`

: calculate number of rows in group`sum`

: sum total of values`mean`

: take mean of values`max`

: take max of values`min`

: take min of values`first`

: take first item found`last`

: take last item found`unique`

: collect a list of unique values`n_unique`

: count the number of unique values`list`

: collect a list of all values`concatenate`

: convert all values to text and concatenate them into one long text`concat_lists`

: concatenate lists in all rows into a single larger list`count_where`

: number of rows in which the column matches a value, needs parameter`value`

with the value that you want to count`percent_where`

: percentage of the column where the column matches a value, needs parameter`value`

with the value that you want to count

Note that in the case of `count_where`

and `percent_where`

an additional `value`

parameter is required.

## Items in `item_aggregations`

rule_aggregations: object | null

Definition of desired aggregations for rules (all items in rule). A dictionary mapping original columns to new aggregated columns, specifying an aggregation function for each.
*Aggregations* are functions that reduce all the values in a particular column of a single item/product
to a single summary value for that item/product. E.g. a `sum`

aggregation of column A calculates a single
total by adding up all the values in A belonging to each item.

Possible aggregations functions accepted as `func`

parameters are:

`n`

,`size`

or`count`

: calculate number of rows in group`sum`

: sum total of values`mean`

: take mean of values`max`

: take max of values`min`

: take min of values`first`

: take first item found`last`

: take last item found`unique`

: collect a list of unique values`n_unique`

: count the number of unique values`list`

: collect a list of all values`concatenate`

: convert all values to text and concatenate them into one long text`concat_lists`

: concatenate lists in all rows into a single larger list`count_where`

: number of rows in which the column matches a value, needs parameter`value`

with the value that you want to count`percent_where`

: percentage of the column where the column matches a value, needs parameter`value`

with the value that you want to count

Note that in the case of `count_where`

and `percent_where`

an additional `value`

parameter is required.