An embedding vector is a numerical representation of a text, such that different numerical components of the vector capture different dimensions of the text’s meaning. Embeddings can be used, for example, to calculate the semantic similarity between pairs of texts (see link_embeddings, for example, to create a network of texts connected by similarity).

In this step, embeddings of texts are calculated as (weighted) averages of the embeddings of each text’s individual words (the individual word embeddings are GloVe vectors, as provided by spaCy’s language models).

Use either the language parameter or a second input column to specify the language of the input texts. If neither is provided, the language will be inferred automatically from the texts themselves (which is equivalent to first creating a language column using the infer_language step).

Usage

The following example shows how the step can be used in a recipe.

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).