link_embeddings
,
for example, to create a network of texts connected by similarity.
In this step, embeddings of texts are calculated using pre-trained
neural language models, especially those using the
popular transformer architecture (e.g.
Bert-based models).
Things to keep in mind
- Unlike
embed_text
, which uses a different, appropriate spaCy model for each language in the text column, this step will always use a single model only to calculate embeddings. This means the model should be multilingual if you have mixed languages, and that otherwise you need to choose the correct model for your (single) language. - Each model will be downloaded on the fly before processing the text. This adds a little lag to its execution time (the bigger the model the longer the download), though for a sufficient number of texts the time spent downloading should not be significant. Note also, however, that the download, and therefore this step, may fail if the servers of its publisher are not responsive.
- Since this step potentially supports tens if not hundreds of different models, we cannot provide support or advice on specific models.
Usage
The following example shows how the step can be used in a recipe.Examples
Examples
To calculate embeddings using a multilingual sentence-bert model (from sentence-transformers):
Inputs & Outputs
The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Inputs
Inputs
A text column to calculate embeddings for.
Outputs
Outputs
A column of embedding vectors capturing the meaning of each input text.
Configuration
The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e.step(..., {"param": "value", ...}) -> (output)
.
Parameters
Parameters
Embed texts using a Sentence-BERT model.
Models in this collection (also known as sentence-transformers) have been trained specifically for semantic
similarity, i.e. for the purpose of comparing the meaning of texts. Individual models in this collection
can be found here: https://www.sbert.net/docs/pretrained_models.html.
They differ in terms of the language they have been trained on; their size (the bigger the better usually,
but also the slower); as well as their purpose or intended area of application (e.g. it has a specific model
to embed scientific publications).
A specific Sentence-BERT model name.
To find a model appropriate for your data or task, check the website of the
Sentence-BERT model collection.
Examples
Examples
- paraphrase-MiniLM-L6-v2
- distiluse-base-multilingual-cased-v2
Whether text embedding vectors should be normalized (to lengths of 1.0).
This may make similarity calculations easier. E.g. we can then use the dot product as a similarity “metric”,
instead of the usual cosine angle (which not all downstream functions may support).
How many texts to push through the model at the same time.
Greater values usually mean faster processing (if supported by the model), but also greater use of memory.Values must be in the following range: