Skip to content

Embed items

vectorize · word2vec · item2vec · model

Trains an item2vec model on provided lists of items (or sentences of words, etc.).

This is essentially the word2vec algorithm applied to arbitrary lists of items. Word2vec computes vectors representing words such that nearby (similar) vectors represent words that are often found in a similar context. Item2vec refers to using the exact same algorithm but applying it to arbitrary lists of items in which the order of items has a comparable interpretation to words in a sentence (the items may be categories, tags, IDs etc.).

Note, that if the order of items in the list (session/basket etc.) is not important, and you simply want item vectors to be similar if the corresponding items usually occur together in the same list, use the window parameter (see below) with a value of "all".

We use gensim to train the item2vec model, so for further details also see it's word2vec page.

Example

The following uses default parameter values only, and thus would be equivalent to using the step without specifying any parameters.

embed_items(products.id, baskets.product_ids, {
  "size": 48,
  "sg": 1,
  "negative": 20,
  "alpha": 0.025,
  "window": 5,
  "min_count": 3,
  "iter": 10,
  "sample": 0
}) -> (products.embedding)

Usage

The following are the step's expected inputs and outputs and their specific types.

embed_items(
    items: category|number,
    sessions: list[category]|list[number], 
    {
        "param": value
    }
) -> (embeddings: list[number])

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Inputs


items: column:category|number

A column containing item identifiers (IDs).


sessions: column:list[category]|list[number]

A column containing lists, where each row is a session, and each session a list of item identifiers (IDs) compatible with the values of the items column.

Outputs


embeddings: column:list[number]

A list column containing item embeddings in the same order as the items input column. Embeddings are lists of numbers (vectors).

Parameters


Also see gensim's word2vec reference for further detail about the underlying algorithm's parameters.


size: integer = 48

Length of resulting embedding vectors.

Range: 1 ≤ size < inf


sg: integer = 1

Whether to use the skip-gram or CBOW algorithm. Set this to 1 for skip-gram, and 0 for CBOW.

Range: 0 ≤ sg ≤ 1


negative: integer = 20

Update maximum for negative-sampling. Only update these many word vectors.


alpha: number = 0.025

Initial learning rate.

Range: 0 ≤ alpha ≤ 1


window: integer | string = 5

Size of word context window. Must be either an integer (the number of neighbouring words to consider), or any of "auto", "max" or "all", in which case the window is equal to the whole list/session/basket.

Must be one of: "auto", "max", "all"


min_count: integer = 3

Minimum count of item in dataset. If an item occurs fewer than this many times it will be ignored.

Range: 1 ≤ min_count < inf


iter: integer = 10

Iterations. How many epochs to run the algorithm for.

Range: 1 ≤ iter < inf


sample: number = 0

Percentage of most-common items to filter out (equivalent to "stop words").

Range: 0 ≤ sample ≤ 1