embed_text

An embedding vector is a numerical representation of a text, such that different numerical components of the vector capture different dimensions of the text’s meaning. Embeddings can be used, for example, to calculate the semantic similarity between pairs of texts (see link_embeddings, for example, to create a network of texts connected by similarity). In this step, embeddings of texts are calculated as (weighted) averages of the embeddings of each text’s individual words (the individual word embeddings are GloVe vectors, as provided by spaCy’s language models). Use either the language parameter or a second input column to specify the language of the input texts. If neither is provided, the language will be inferred automatically from the texts themselves (which is equivalent to first creating a language column using the infer_language step).

Usage

The following example shows how the step can be used in a recipe.

Examples

To calculate embeddings in a way that emphasizes entities (recognized products, people etc.) over regular words:

embed_text(ds.text, ds.lang, {"weighted": true}) -> (ds.embedding)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Inputs

Outputs

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

embedding

object

Configure how embeddings are calculated. Toggle word vector weighting and normalization.

Properties

extended_language_support

boolean

default:"false"

Whether to enable support for additional languages. By default, Arabic (“ar”), Catalan (“ca”), Basque (“eu”), and Turkish (“tu”) are not enabled, since they’re supported only by a different class of language models (stanfordNLP’s Stanza) that is much slower than the rest. This parameter can be used to enable them.

min_language_freq

[number, integer]

default:"0.02"

Minimum number (or proportion) of texts to include a language in processing. Any texts in a language with fewer documents than these will be ignored. Can be useful to speed up processing when there is noise in the input languages, and when ignoring languages with a small number of documents only is acceptable. Values smaller than 1 will be interpreted as a proportion of all texts, and values greater than or equal to 1 as an absolute number of documents.

Options

language

[string, null]

The language of inputs texts. If all texts are in the same language, it can be specified here instead of passing it as an input column. The language will be used to identify the correct spaCy model to parse and analyze the texts. For allowed values, see the comment regarding the lang column above.

Prepare

Report

Analyse

Usage

Inputs & Outputs

Configuration

Prepare

Report

Analyse

​Usage

​Inputs & Outputs

​Configuration

Usage

Inputs & Outputs

Configuration