Skip to main content
This is essentially the word2vec algorithm applied to arbitrary lists of items. Word2vec computes vectors representing words such that nearby (similar) vectors represent words that are often found in a similar context. Item2vec refers to using the exact same algorithm but applying it to arbitrary lists of items in which the order of items has a comparable interpretation to words in a sentence (the items may be categories, tags, IDs etc.). Note, that if the order of items in the list (session/basket etc.) is not important, and you simply want item vectors to be similar if the corresponding items usually occur together in the same list, use the window parameter (see below) with a value of “all”. We use gensim to train the item2vec model, so for further details also see it’s word2vec page.

Usage

The following example shows how the step can be used in a recipe.

Examples

  • Example 1
  • Signature
The following uses default parameter values only, and thus would be equivalent to using the step without specifying any parameters.
embed_items(products.id, baskets.product_ids, {
  "size": 48,
  "sg": 1,
  "negative": 20,
  "alpha": 0.025,
  "window": 5,
  "min_count": 3,
  "iter": 10,
  "sample": 0
}) -> (products.embedding)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").
items
column[category|number]
required
A column containing item identifiers (IDs).
sessions
column[list[category]|list[number]]
required
A column containing lists, where each row is a session, and each session a list of item identifiers (IDs) compatible with the values of the items column.
embeddings
column[list[number]]
required
A list column containing item embeddings in the same order as the items input column. Embeddings are lists of numbers (vectors).

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

size
integer
default:"48"
Length of resulting embedding vectors.Values must be in the following range:
1size < inf
sg
integer
default:"1"
Whether to use the skip-gram or CBOW algorithm. Set this to 1 for skip-gram, and 0 for CBOW.Values must be in the following range:
0sg1
negative
integer
default:"20"
Update maximum for negative-sampling. Only update these many word vectors.
alpha
number
default:"0.025"
Initial learning rate.Values must be in the following range:
0alpha1
window
[integer, string]
default:"5"
Size of word context window. Must be either an integer (the number of neighbouring words to consider), or any of “auto”, “max” or “all”, in which case the window is equal to the whole list/session/basket.
  • integer
  • string
{_}
integer
integer.Values must be in the following range:
1 ≤ {_} < inf
min_count
integer
default:"3"
Minimum count of item in dataset. If an item occurs fewer than this many times it will be ignored.Values must be in the following range:
1min_count < inf
iter
integer
default:"10"
Iterations. How many epochs to run the algorithm for.Values must be in the following range:
1iter < inf
sample
number
default:"0"
Percentage of most-common items to filter out (equivalent to “stop words”).Values must be in the following range:
0sample1
normalize
boolean
default:"true"
Whether to return normalized item vectors.
I