association_rules

This is a form of market basket analysis. It analyses items (products) that occur unusually frequent together in a set of transactions (baskets). The step creates an association rule, such as A->B, between items A and B, if the presence of A makes the presence of B in the same session N times more likely. For further details about the algorithm see e.g. association rule learning.

Usage

The following example shows how the step can be used in a recipe.

Examples

The following call creates rules between pairs of items A and B, if:

A occurs in at least 7 sessions
B occurs in at least 25% of sessions containing A
The presence of A in a session makes the presence of B in the same session at least twice as likely.

Note that the last condition is equivalent to saying that the overall frequency of B in all sessions must be less than 12.5% (half of 25%). In other words, a minimum lift of 2 means that the frequency of B, in sessions already containing A, must be twice the background frequency of B in general.As an example, the percentage of shopping baskets containing milk (item B) may be 10%. However, amongst those baskets already containing cereals, the percentage containing milk is likely to be higher. If milk occured e.g. in 30% of baskets also having cereals, than the lift of the rule cereal->milk would be 3. The buying of cereal make the buying of milk 3 times more likely.

association_rules(transactions, {
  "item_id": "product_id",
  "session_id": "order_id",
  "min_support": 7
  "min_confidence": 25
  "min_lift": 2
}) -> (rules)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Inputs

Outputs

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

item_id

string (transactions.column)

required

Name of column uniquely identifying all items/products.

session_id

[string, array[string]]

required

Name(s) of column(s) uniquely identifying all sessions/baskets/orders.

Array items

item_label

string (transactions.column)

required

Column used to label items in a user-friendly manner.

itemset_min

integer

default:"2"

Minimum size of itemsets to identify. E.g. an itemsize of 3 means association rules will have 2 antecedents (e.g. A, B) and 1 consequent (C), resulting in rules of the form (A, B) -> C. The step will currently generate only single items as consequents.Values must be in the following range:

2 ≤ itemset_min ≤ 5

itemset_max

integer

default:"3"

Maximum size of itemsets to identify. E.g. an itemsize of 3 means association rules will have 2 antecedents (e.g. A, B) and 1 consequent (C), resulting in rules of the form (A, B) -> C. The step will currently generate only single items as consequents.Values must be in the following range:

2 ≤ itemset_max ≤ 5

min_support

[number, integer]

default:"10"

Minimum Support. Minimum support of a rule antecedent. If it is < 1 it will be taken as a proportion. In any other case it will be expected as a positive integer representing the count. Create rule A->B only if A occurred in at least this many sessions.

Options

min_confidence

number

default:"20"

Minimum Confidence. Expressed as a percentage. Include link A->B only if B occurred in at least this percentage of sessions also containing A.Values must be in the following range:

0 ≤ min_confidence ≤ 100

min_lift

[number, null]

Minimum Lift. Expressed as multipler/ratio. Include link A->B only if A makes the presence of B in the same sessions at least this many times more likely.

weight_metric

string

default:"rule_lift_abs"

Metric for link weight. Which association rule metric to use as the weight of links in the network generated by this step.Values must be one of the following:itemset_support_abs itemset_support_pct antecedent_support_abs antecedent_support_pct consequent_support_abs consequent_support_pct rule_confidence_pct rule_lift_abs rule_lift_pct

link_rules

boolean

default:"true"

Whether to link items to rules. Otherwise, a product (antecedent) will be linked only to other products (consequent).

link_top_n

[integer, null]

Only keep N links with largest weight. This applies individually to each node in the network, filtering its outgoing links to keep only the first N by weight. The value of weights itself is selected using the weight_metric parameter, i.e. corresponds to one of the association rule metrics (support, confidence etc.). If null, all links will be kept.

item_aggregations

[object, null]

Definition of desired aggregations for (consequent) items. A dictionary mapping original columns to new aggregated columns, specifying an aggregation function for each. Aggregations are functions that reduce all the values in a particular column of a single item/product to a single summary value for that item/product. E.g. a sum aggregation of column A calculates a single total by adding up all the values in A belonging to each item.Possible aggregations functions accepted as func parameters are:

n, size or count: calculate number of rows in group
sum: sum total of values
mean: take mean of values
max: take max of values
min: take min of values
first: take first item found
last: take last item found
unique: collect a list of unique values
n_unique: count the number of unique values
list: collect a list of all values
concatenate: convert all values to text and concatenate them into one long text
concat_lists: concatenate lists in all rows into a single larger list
count_where: number of rows in which the column matches a value, needs parameter value with the value that you want to count
percent_where: percentage of the column where the column matches a value, needs parameter value with the value that you want to count

Note that in the case of count_where and percent_where an additional value parameter is required.

rule_aggregations

[object, null]

Definition of desired aggregations for rules (all items in rule). A dictionary mapping original columns to new aggregated columns, specifying an aggregation function for each. Aggregations are functions that reduce all the values in a particular column of a single item/product to a single summary value for that item/product. E.g. a sum aggregation of column A calculates a single total by adding up all the values in A belonging to each item.Possible aggregations functions accepted as func parameters are:

n, size or count: calculate number of rows in group
sum: sum total of values
mean: take mean of values
max: take max of values
min: take min of values
first: take first item found
last: take last item found
unique: collect a list of unique values
n_unique: count the number of unique values
list: collect a list of all values
concatenate: convert all values to text and concatenate them into one long text
concat_lists: concatenate lists in all rows into a single larger list
count_where: number of rows in which the column matches a value, needs parameter value with the value that you want to count
percent_where: percentage of the column where the column matches a value, needs parameter value with the value that you want to count

Note that in the case of count_where and percent_where an additional value parameter is required.

Prepare

Report

Analyse

Usage

Inputs & Outputs

Configuration

Prepare

Report

Analyse

​Usage

​Inputs & Outputs

​Configuration

Usage

Inputs & Outputs

Configuration