embed_sessions

Lists of items may represent pages visited in a browsing session, shopping baskets and the products they contain, sentences of words, etc. The step calculates embeddings vectors for all item lists, such that two vectors are similar if their corresponding lists of items are similar. Similarity here is measured as an average over the individual items. Essentially, we first calculate embeddings vectors representing individual items (using word2vec), and then average over all items belonging to the same list/session. As an example, consider a dataset containing shopping baskets. In this case the step will first calculate embeddings for individual products. The resulting vectors will be similar if they represent objects that are often bought together. E.g. the vectors for sausages and hot dog bread may be more similar to each other than those representing shampoo and toys. Then, to arrive at an embedding vector for each basket, we simply average over all its individual products. The result will capture the similarity between baskets in terms of the mix of products they contain. And so the vectors representing baskets of people buying a significant amount of baby products will be more similar to each other than to vectors representing baskets of people buying products for a BBQ party. To only calculate individual item embeddings see the complementary embed_items step. Also, we use gensim to train the item2vec model, so for further details also see it’s word2vec page.

Usage

The following example shows how the step can be used in a recipe.

Examples

embed_sessions(baskets.products, {
  "size": 48,
  "sg": 1,
  "negative": 20,
  "alpha": 0.025,
  "window": 5,
  "min_count": 3,
  "iter": 10,
  "sample": 0,
  "workers": 3
}) -> (baskets.embedding)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Inputs

Outputs

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

size

integer

default:"48"

Length of embedding vectors.Values must be in the following range:

1 ≤ size < inf

integer

default:"1"

Use Skip-Gram or CBOW. Set this to 1 to use Skip-Gram, 0 for CBOW.Values must be in the following range:

0 ≤ sg ≤ 1

negative

integer

default:"20"

Update maximum for negative-sampling. Only update these many word vectors.

alpha

number

default:"0.025"

Initial Learning Rate.Values must be in the following range:

0 ≤ alpha ≤ 1

window

[integer, string]

default:"5"

Word context window. Must be either an integer or “auto”, “max” or “all”.

Options

min_count

integer

default:"3"

Minimum count of item in dataset. Otherwise filtered out.Values must be in the following range:

1 ≤ min_count < inf

iter

integer

default:"10"

Iterations. How many epochs to run the algorithm for.Values must be in the following range:

1 ≤ iter < inf

sample

number

default:"0"

Sample. Percentage of most-common items to filter out.Values must be in the following range:

0 ≤ sample ≤ 1

normalize

boolean

default:"true"

Whether to return normalized item vectors.

Prepare

Report

Analyse

Usage

Inputs & Outputs

Configuration

Prepare

Report

Analyse

​Usage

​Inputs & Outputs

​Configuration

Usage

Inputs & Outputs

Configuration