extract_text_features
Parse and process texts to extract multiple features at once.
extract_text_features(text: text, *lang: category, {
"param": value,
...
}) -> (
Sentiment: number,
Embedding: list[number],
Hashtags: list[category],
Mentions: list[category],
Keywords: list[category],
Tokens: list[category],
Emoji: list[category],
People: list[category],
Groups: list[category],
Organizatons: list[category],
GPEs: list[category],
Locations: list[category],
Products: list[category],
Events: list[category],
Money: list[category]
)
Essentially combines all of the following steps into one:
embed_text
extract_emoji
extract_entities
extract_hashtags
extract_keywords
extract_mentions
infer_sentiment
tokenize
Note that the step does not currently allow for detailed configuration of each of the extracted features. To do that, use any or all of the individual steps above.
extract_text_features(text: text, *lang: category, {
"param": value,
...
}) -> (
Sentiment: number,
Embedding: list[number],
Hashtags: list[category],
Mentions: list[category],
Keywords: list[category],
Tokens: list[category],
Emoji: list[category],
People: list[category],
Groups: list[category],
Organizatons: list[category],
GPEs: list[category],
Locations: list[category],
Products: list[category],
Events: list[category],
Money: list[category]
)
A text column to extract n-grams from.
An (optional) column identifying the languages of the corresponding texts. It is used to identify the correct model (spaCy)
to use for each text. If the dataset doesn’t contain such a column yet, it can be created using the infer_language
step.
Ideally, languages should be expressed as two-letter
ISO 639-1 language codes, such as “en”, “es” or “de” for
English, Spanish or German respectively. We also detect fully spelled out names such as “english”, “German”, “allemande”
etc., but it is not guaranteed that we will recognize all possible spellings correctly always, so ISO codes should be
preferred.
Alternatively, if all texts are in the same language, it can be identified with the lang
parameter instead.
Was this page helpful?
extract_text_features(text: text, *lang: category, {
"param": value,
...
}) -> (
Sentiment: number,
Embedding: list[number],
Hashtags: list[category],
Mentions: list[category],
Keywords: list[category],
Tokens: list[category],
Emoji: list[category],
People: list[category],
Groups: list[category],
Organizatons: list[category],
GPEs: list[category],
Locations: list[category],
Products: list[category],
Events: list[category],
Money: list[category]
)