Skip to content

Infer sentiment


Parse text and calculate the overall positive or negative sentiment polarity.

Polarity is measured on the normalized scale [-1, 1]. The method used here is rather naïve. It simply looks up each word in the text in a "polarity lexicon", which assigns each emotionally charged word a numeric score. The individual scores are then simply averaged across the whole text. This will hence not account for contexts involving irony, sarcasm, or even simple negations.


The following are the step's expected inputs and outputs and their specific types.

Step signature
    text: text,
    *lang: category, {
    "param": value
}) -> (sentiment: number)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.


To detect the sentiment for languages supported by default, use:

Example call (in recipe editor)
infer_sentiment(ds.text, ds.lang) -> (ds.sentiment)
More examples

To only process those languages used in at least 1% of the input texts:

Example call (in recipe editor)
infer_sentiment(ds.text, ds.lang, {"min_lang_docs": 0.01}) -> (ds.sentiment)


text: column:text

A text column to infer sentiment polarities for.

*lang: column:category

An (optional) column identifying the languages of the corresponding texts. It is used to identify the correct model (spaCy) to use for each text. If the dataset doesn't contain such a column yet, it can be created using the infer_language step. Ideally, languages should be expressed as two-letter ISO 639-1 language codes, such as "en", "es" or "de" for English, Spanish or German respectively. We also detect fully spelled out names such as "english", "German", "allemande" etc., but it is not guaranteed that we will recognize all possible spellings correctly always, so ISO codes should be preferred.

Alternatively, if all texts are in the same language, it can be identified with the language parameter instead.


sentiment: column:number

A column containing the overall sentiment polarity for each input text.


extended_language_support: boolean = False

Whether to enable support for additional languages. By default, Arabic ("ar"), Catalan ("ca"), Basque ("eu"), and Turkish ("tu") are not enabled, since they're supported only by a different class of language models (stanfordNLP's Stanza) that is much slower than the rest. This parameter can be used to enable them.

min_language_freq: number | integer = 0.02

Minimum number (or proportion) of texts to include a language in processing. Any texts in a language with fewer documents than these will be ignored. Can be useful to speed up processing when there is noise in the input languages, and when ignoring languages with a small number of documents only is acceptable. Values smaller than 1 will be interpreted as a proportion of all texts, and values greater than or equal to 1 as an absolute number of documents.

language: string | null

The language of inputs texts. If all texts are in the same language, it can be specified here instead of passing it as an input column. The language will be used to identify the correct spaCy model to parse and analyze the texts. For allowed values, see the comment regarding the lang column above.