Parse and calculate a (word-averaged) embedding vector for each text.
link_embeddings
, for example, to create a network of texts connected by similarity).
In this step, embeddings of texts are calculated as (weighted) averages of the embeddings of each text’s individual
words (the individual word embeddings are GloVe vectors, as provided by
spaCy’s language models).
Use either the language
parameter or a second input column to specify the language of the input texts. If neither
is provided, the language will be inferred automatically from the texts themselves (which is equivalent to first creating
a language column using the infer_language
step).
Examples
ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Inputs
infer_language
step.
Ideally, languages should be expressed as two-letter
ISO 639-1 language codes, such as “en”, “es” or “de” for
English, Spanish or German respectively. We also detect fully spelled out names such as “english”, “German”, “allemande”
etc., but it is not guaranteed that we will recognize all possible spellings correctly always, so ISO codes should be
preferred.Alternatively, if all texts are in the same language, it can be identified with the language
parameter instead.Outputs
step(..., {"param": "value", ...}) -> (output)
.
Parameters
Options
lang
column above.