Skip to main content
Essentially combines all of the following steps into one:
  • embed_text
  • extract_emoji
  • extract_entities
  • extract_hashtags
  • extract_keywords
  • extract_mentions
  • infer_sentiment
  • tokenize
Note that the step does not currently allow for detailed configuration of each of the extracted features. To do that, use any or all of the individual steps above.

Usage

The following shows how the step can be used in a recipe.

Examples

  • Signature
General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.
extract_text_features(text: text, *lang: category, {
    "param": value,
    ...
}) -> (
	Sentiment: number,
	Embedding: list[number],
	Hashtags: list[category],
	Mentions: list[category],
	Keywords: list[category],
	Tokens: list[category],
	Emoji: list[category],
	People: list[category],
	Groups: list[category],
	Organizatons: list[category],
	GPEs: list[category],
	Locations: list[category],
	Products: list[category],
	Events: list[category],
	Money: list[category]
)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").
text
column[text]
required
A text column to extract n-grams from.
*lang
column[category]
An (optional) column identifying the languages of the corresponding texts. It is used to identify the correct model (spaCy) to use for each text. If the dataset doesn’t contain such a column yet, it can be created using the infer_language step. Ideally, languages should be expressed as two-letter ISO 639-1 language codes, such as “en”, “es” or “de” for English, Spanish or German respectively. We also detect fully spelled out names such as “english”, “German”, “allemande” etc., but it is not guaranteed that we will recognize all possible spellings correctly always, so ISO codes should be preferred.Alternatively, if all texts are in the same language, it can be identified with the lang parameter instead.
Sentiment
column[number]
required
Embedding
column[list[number]]
required
Hashtags
column[list[category]]
required
Mentions
column[list[category]]
required
Keywords
column[list[category]]
required
Tokens
column[list[category]]
required
Emoji
column[list[category]]
required
People
column[list[category]]
required
Groups
column[list[category]]
required
Organizatons
column[list[category]]
required
GPEs
column[list[category]]
required
Locations
column[list[category]]
required
Products
column[list[category]]
required
Events
column[list[category]]
required
Money
column[list[category]]
required

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

This step doesn’t expect any specific parameters.
I