Skip to content

Embed text

NLP · vectorize · text · word2vec · GloVe · model

Parse and calculate a (word-averaged) embedding vector for each text.

An embedding vector is a numerical representation of a text, such that different numerical components of the vector capture different dimensions of the text's meaning. Embeddings can be used, for example, to calculate the semantic similarity between pairs of texts (see link_embeddings, for example, to create a network of texts connected by similarity).

In this step, embeddings of texts are calculated as (weighted) averages of the embeddings of each text's individual words (the individual word embeddings are GloVe vectors, as provided by spaCy's language models).


To calculate embeddings in a way that emphasizes entities (recognized products, people etc.) over regular words:

embed_text(ds.text, ds.lang, {"weighted": true}) -> (ds.embedding)


The following are the step's expected inputs and outputs and their specific types.

    text: text,
    lang: category, 
        "param": value
) -> (embedding: list[number])

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.


text: column:text

A text column to calculate embeddings for.

lang: column:category

A column identifying the languages of the corresponding texts. If the dataset doesn't contain such a column yet, it can be created using the infer_language step. Ideally, languages should be expressed as two-letter ISO 639-1 language codes, such as "en", "es" or "de" for English, Spanish or German respectively. We also detect fully spelled out names such as "english", "German", "allemande" etc., but it is not guaranteed that we will recognize all possible spellings correctly always, so ISO codes should be preferred.


embedding: column:list[number]

A column of embedding vectors capturing the meaning of each input text.


embedding: object

Configure how embeddings are calculated. Toggle word vector weighting and normalization.

Items in embedding

weighted: boolean = True

Whether entities have more influence on the embedding than regular words.

normalized: boolean = True

Whether to normalize embeddings. Each will have a length (norm) of 1.0.

extended_language_support: boolean = False

Whether to enable support for additional languages. By default, Catalan and Basque are not enabled, since they're supported only by a different class of language models that is much slower than the rest. This parameter can be used to enable them.

min_language_freq: number | integer = 0.02

Minimum number (or proportion) of texts to include a language in processing. Any texts in a language with fewer documents than these will be ignored. Can be useful to speed up processing when there is noise in the input languages, and when ignoring languages with a small number of documents only is acceptable. Values smaller than 1 will be interpreted as a proportion of all texts, and values greater than or equal to 1 as an absolute number of documents.