An embedding vector is a numerical representation of a text, such that different numerical components of the vector capture different dimensions of the text’s meaning. Embeddings can be used, for example, to calculate the semantic similarity between pairs of texts. See link_embeddings, for example, to create a network of texts connected by similarity.

In this step, embeddings of texts are calculated using pre-trained neural language models, especially those using the popular transformer architecture (e.g. Bert-based models).

Things to keep in mind

  • Unlike embed_text, which uses a different, appropriate spaCy model for each language in the text column, this step will always use a single model only to calculate embeddings. This means the model should be multilingual if you have mixed languages, and that otherwise you need to choose the correct model for your (single) language.
  • Each model will be downloaded on the fly before processing the text. This adds a little lag to its execution time (the bigger the model the longer the download), though for a sufficient number of texts the time spent downloading should not be significant. Note also, however, that the download, and therefore this step, may fail if the servers of its publisher are not responsive.
  • Since this step potentially supports tens if not hundreds of different models, we cannot provide support or advice on specific models.

Usage

The following example shows how the step can be used in a recipe.

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).