> ## Documentation Index
> Fetch the complete documentation index at: https://docs.graphext.com/llms.txt
> Use this file to discover all available pages before exploring further.

# embed_text_with_model

> Use language models to calulate an embedding for each text in provided column. 

An embedding vector is a numerical representation of a text, such that different numerical components of the vector
capture different dimensions of the text's meaning. Embeddings can be used, for example, to calculate the *semantic similarity*
between pairs of texts. See [`link_embeddings`](https://docs.graphext.com/api-docs/analyse/graph_and_map/create_graph/link_embeddings/),
for example, to create a network of texts connected by similarity.

In this step, embeddings of texts are calculated using pre-trained
[neural language models](https://en.wikipedia.org/wiki/Language»model#Neural_network), especially those using the
popular [transformer architecture](https://huggingface.co/course/chapter1/4) (e.g.
[Bert-based models](https://huggingface.co/transformers/model_doc/bert.html)).

## Things to keep in mind

* Unlike [`embed_text`](https://docs.graphext.com/api-docs/prepare/embed/embed_text/), which uses a different, appropriate spaCy
  model for each language in the text column, this step will always use a single model only to calculate embeddings. This
  means the model should be multilingual if you have mixed languages, and that otherwise you need to choose the
  correct model for your (single) language.
* Each model will be downloaded on the fly before processing the text. This adds a little lag to its execution time (the
  bigger the model the longer the download), though for a sufficient number of texts the time spent downloading should not
  be significant. Note also, however, that the download, and therefore this step, may fail if the servers of its publisher
  are not responsive.
* Since this step potentially supports tens if not hundreds of different models, we cannot provide support or advice on
  specific models.

## Usage

The following example shows how the step can be used in a recipe.

<Accordion title="Examples" icon="code" defaultOpen="true">
  <Tabs>
    <Tab title="Example 1">
      To calculate embeddings using a multilingual sentence-bert model (from sentence-transformers):

      ```stan theme={null}
      embed_text_with_model(ds.text, {"collection": "SBERT", "name": "distiluse-base-multilingual-cased-v2"}) -> (ds.embedding)
      ```
    </Tab>

    <Tab title="Signature">
      General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.

      ```stan theme={null}
      embed_text_with_model(text: text, {
          "param": value,
          ...
      }) -> (embedding: list[number])
      ```
    </Tab>
  </Tabs>
</Accordion>

## Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally
columns (`ds.first_name`), datasets (`ds` or `ds[["first_name", "last_name"]]`) or models (referenced
by name e.g. `"churn-clf"`).

<Accordion title="Inputs" icon="right-to-bracket">
  <ParamField path="text" type="column[text]" required>
    A text column to calculate embeddings for.
  </ParamField>
</Accordion>

<Accordion title="Outputs" icon="right-from-bracket">
  <ParamField path="embedding" type="column[list[number]]" required>
    A column of embedding vectors capturing the meaning of each input text.
  </ParamField>
</Accordion>

## Configuration

The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last "input" to the step, i.e. `step(..., {"param": "value", ...}) -> (output)`.

<Accordion title="Parameters" defaultOpen="true" icon="sliders">
  <Tabs>
    <Tab title="Sentence-BERT">
      <ParamField path="collection" type="string" default="SBERT" required>
        Embed texts using a *Sentence-BERT* model.
        Models in this collection (also known as *sentence-transformers*) have been trained specifically for semantic
        similarity, i.e. for the purpose of comparing the meaning of texts. Individual models in this collection
        can be found here: [https://www.sbert.net/docs/pretrained\_models.html](https://www.sbert.net/docs/pretrained_models.html).
        They differ in terms of the language they have been trained on; their size (the bigger the better usually,
        but also the slower); as well as their purpose or intended area of application (e.g. it has a specific model
        to embed scientific publications).
      </ParamField>

      <ParamField path="name" type="string" required>
        A specific *Sentence-BERT* model name.
        To find a model appropriate for your data or task, check the website of the
        [Sentence-BERT model collection](https://www.sbert.net/docs/pretrained_models.html).

        <Accordion title="Examples">
          * paraphrase-MiniLM-L6-v2
          * distiluse-base-multilingual-cased-v2
        </Accordion>
      </ParamField>

      <ParamField path="normalize" type="boolean" default="true">
        Whether text embedding vectors should be normalized (to lengths of 1.0).
        This may make similarity calculations easier. E.g. we can then use the dot product as a similarity "metric",
        instead of the usual cosine angle (which not all downstream functions may support).
      </ParamField>

      <ParamField path="batch_size" type="integer" default="32">
        How many texts to push through the model at the same time.
        Greater values usually mean faster processing (if supported by the model), but also greater use of memory.

        Values must be in the following range:

        ```javascript theme={null}
        1 ≤ batch_size < inf
        ```
      </ParamField>
    </Tab>

    <Tab title="Hugging Face">
      <ParamField path="collection" type="string" default="HF" required>
        Embed texts using a model from the *Hugging Face* hub.
        Any pytorch or tensorflow model in [HF's hub](https://huggingface.co/models)
        can be used as long as its output contains a [last hidden state](https://huggingface.co/transformers/main_classes/output.html#).
        Note however, that using the output embedding of an arbitrary transformer is not always useful, and
        specifically may not be approriate for sentence similarity. Rather, these embeddings usually represent the
        input for downstream classification tasks instead. A sentence-bert or universal sentence encoder may be more
        appopriate in most cases.
      </ParamField>

      <ParamField path="name" type="string" required>
        A specific *Hugging Face* model name.
        To find a model appropriate for your data or task, browse the [Hugging Face model hub](https://huggingface.co/models).
        Note that the `name` of a model should include the name of the organization if applicable (e.g.
        `"cardiffnlp/"` in the example below).

        <Accordion title="Examples">
          * cardiffnlp/twitter-xlm-roberta-base
          * sentence-transformers/paraphrase-xlm-r-multilingual-v1
        </Accordion>
      </ParamField>

      <ParamField path="normalize" type="boolean" default="true">
        Whether text embedding vectors should be normalized (to lengths of 1.0).
        This may make similarity calculations easier. E.g. we can then use the dot product as a similarity "metric",
        instead of the usual cosine angle (which not all downstream functions may support).
      </ParamField>

      <ParamField path="batch_size" type="integer" default="32">
        How many texts to push through the model at the same time.
        Greater values usually mean faster processing (if supported by the model), but also greater use of memory.

        Values must be in the following range:

        ```javascript theme={null}
        1 ≤ batch_size < inf
        ```
      </ParamField>

      <ParamField path="pooling" type="string" default="max">
        How individual "word" embeddings should be combined.
        The output of a transformer contains embeddings for individual words (or sentence pieces, sub-word character
        sequences etc.). This parameter determines how these are combined to create a single vector representing the
        whole text. This can be the *mean* of individual vectors or the (component-wise) *maximum* (currently pooling
        doesn't take the attention mask into account).

        Values must be one of the following:

        * `mean`
        * `max`
      </ParamField>
    </Tab>
  </Tabs>
</Accordion>
