Skip to content

Embed text with model

NLP · vectorize · text · word2vec · GloVe · model · transformer · hugging face

Use language models to calulate an embedding for each text in provided column.

An embedding vector is a numerical representation of a text, such that different numerical components of the vector capture different dimensions of the text's meaning. Embeddings can be used, for example, to calculate the semantic similarity between pairs of texts. See link_embeddings, for example, to create a network of texts connected by similarity.

In this step, embeddings of texts are calculated using pre-trained neural language models, especially those using the popular transformer architecture (e.g. Bert-based models).

Experimental

This function is still in the experimental stage and we do not guarantee it won't fail for some combination of model and parameters. Feel free to get in touch if you have problems using it.

Things to keep in mind

  • Unlike embed_text, which uses a different, appropriate spaCy model for each language in the text column, this step will always use a single model only to calculate embeddings. This means the model should be multilingual if you have mixed languages, and that otherwise you need to choose the correct model for your (single) language.
  • Each model will be downloaded on the fly before processing the text. This adds a little lag to its execution time (the bigger the model the longer the download), though for a sufficient number of texts the time spent downloading should not be significant. Note also, however, that the download, and therefore this step, may fail if the servers of its publisher are not responsive.
  • Since this step potentially supports tens if not hundreds of different models, we cannot provide support or advice on specific models.
  • We currently support execution of models on CPUs only, so the performance is not yet what it will be. We're working to have GPU support in the near future.

Example

To calculate embeddings using a multilingual sentence-bert model (from sentence-transformers):

embed_text_with_model(ds.text, {"collection": "SBERT", "name": "distiluse-base-multilingual-cased-v2"}) -> (ds.embedding)
More examples

To calculate embeddings using an English Universal Sentence Encoder:

embed_text_with_model(ds.text, {"collection": "USE", "name": "universal-sentence-encoder/4"}) -> (ds.embedding)

Or to use word-averaged GloVe vectors from spaCy:

embed_text_with_model(ds.text, {"collection": "SPACY", "name": "en_core_web_md"}) -> (ds.embedding)

Usage

The following are the step's expected inputs and outputs and their specific types.

embed_text_with_model(text: text, {"param": value}) -> (embedding: list[number])

Inputs


text: column:text

A text column to calculate embeddings for.

Outputs


embedding: column:list[number]

A column of embedding vectors capturing the meaning of each input text.

Parameters


Model parameters depend on the collection/type a model belongs to. Pick a collection below for further details.

collection: string = "SBERT"

Embed texts using a Sentence-BERT model. Models in this collection (also known as sentence-transformers) have been trained specifically for semantic similarity, i.e. for the purpose of comparing the meaning of texts. Individual models in this collection can be found here: https://www.sbert.net/docs/pretrained_models.html. They differ in terms of the language they have been trained on; their size (the bigger the better usually, but also the slower); as well as their purpose or intended area of application (e.g. it has a specific model to embed scientific publications).


name: string

A specific Sentence-BERT model name. To find a model appropriate for your data or task, check the website of the Sentence-BERT model collection.

Example parameter values:

  • "paraphrase-MiniLM-L6-v2"
  • "distiluse-base-multilingual-cased-v2"

normalize: boolean = True

Whether text embedding vectors should be normalized (to lengths of 1.0). This may make similarity calculations easier. E.g. we can then use the dot product as a similarity "metric", instead of the usual cosine angle (which not all downstream functions may support).


batch_size: integer = 32

How many texts to push through the model at the same time. Greater values usually mean faster processing (if supported by the model), but also greater use of memory.

Range: 1 ≤ batch_size < inf

collection: string = "USE"

Embed texts using a model from the Universal Sentence Encoder collection. These are general purpose contextualized text embeddings from google's https://tfhub.dev/google/collections/universal-sentence-encoder/. The family contains multilingual as well as single language (English) models in various sizes.


name: string

The model name of a specific universal sentence encoder. The names of models in this family can be found in its corresponding tensorflow-hub collection. Once you select a specific model, check the URL of the page. The part after https://tfhub.dev/google/ is the name expected in this parameter (unfortunately the tensorflow-hub doesn't offer a nicer way to view a model's name).

Example parameter values:

  • "universal-sentence-encoder-multilingual/3"
  • "universal-sentence-encoder/4"
  • "universal-sentence-encoder-xling-many/1"

normalize: boolean = True

Whether text embedding vectors should be normalized (to lengths of 1.0). This may make similarity calculations easier. E.g. we can then use the dot product as a similarity "metric", instead of the usual cosine angle (which not all downstream functions may support).


batch_size: integer = 32

How many texts to push through the model at the same time. Greater values usually mean faster processing (if supported by the model), but also greater use of memory.

Range: 1 ≤ batch_size < inf

collection: string = "HF"

Embed texts using a model from the Hugging Face hub. Any pytorch or tensorflow model in HF's hub can be used as long as its output contains a last hidden state. Note however, that using the output embedding of an arbitrary transformer is not always useful, and specifically may not be approriate for sentence similarity. Rather, these embeddings usually represent the input for downstream classification tasks instead. A sentence-bert or universal sentence encoder may be more appopriate in most cases.


name: string

A specific Hugging Face model name. To find a model appropriate for your data or task, browse the Hugging Face model hub. Note that the name of a model should include the name of the organization if applicable (e.g. "cardiffnlp/" in the example below).

Example parameter values:

  • "cardiffnlp/twitter-xlm-roberta-base"
  • "sentence-transformers/paraphrase-xlm-r-multilingual-v1"

normalize: boolean = True

Whether text embedding vectors should be normalized (to lengths of 1.0). This may make similarity calculations easier. E.g. we can then use the dot product as a similarity "metric", instead of the usual cosine angle (which not all downstream functions may support).


batch_size: integer = 32

How many texts to push through the model at the same time. Greater values usually mean faster processing (if supported by the model), but also greater use of memory.

Range: 1 ≤ batch_size < inf


pooling: string = "max"

How individual "word" embeddings should be combined. The output of a transformer contains embeddings for individual words (or sentence pieces, sub-word character sequences etc.). This parameter determines how these are combined to create a single vector representing the whole text. This can be the mean of individual vectors or the (component-wise) maximum (currently pooling doesn't take the attention mask into account).

Must be one of: "mean", "max"

collection: string = "SPACY"

Embed texts using a spaCy language model. See https://spacy.io/models/ for a list of names of supported models. Note that models of size "sm", "md", or "lg" will generate embeddings that are weighted averages of GloVe word vectors. SpaCy transformer models (names ending in "trf") will generate contextualized embeddings. Note though that these are tuned for spaCy's "internal" usage (predicting part-of-speech) etc., but not e.g. sentence-similarity.


name: string

A specific spaCy model name. To find a model appropriate for your language, check spaCy's model documentation.

Example parameter values:

  • "en_core_web_md"
  • "es_dep_news_trf"