Parse and extract keywords from texts.
The text elements considered keywords are configurable. They can include detected noun phrases (compound nouns like ‘the quick brown fox’), any automatically recognized entities (people, products, events), or any lexical category of word, such as nouns, verbs, adjectives etc.
The following examples show how the step can be used in a recipe.
Examples
To extract all kinds of nouns only, i.e. entities, compound nouns, simple nouns and proper nouns (names):
To extract all kinds of nouns only, i.e. entities, compound nouns, simple nouns and proper nouns (names):
To also include adjectives, and limit keywords to those that occur in at least 3 but no more than 90% of all documents:
General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.
The following are the inputs expected by the step and the outputs it produces. These are generally
columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Inputs
A text column to extract keywords from.
An (optional) column identifying the languages of the corresponding texts. It is used to identify the correct model (spaCy)
to use for each text. If the dataset doesn’t contain such a column yet, it can be created using the infer_language
step.
Ideally, languages should be expressed as two-letter
ISO 639-1 language codes, such as “en”, “es” or “de” for
English, Spanish or German respectively. We also detect fully spelled out names such as “english”, “German”, “allemande”
etc., but it is not guaranteed that we will recognize all possible spellings correctly always, so ISO codes should be
preferred.
Alternatively, if all texts are in the same language, it can be identified with the language
parameter instead.
Outputs
Lists containing the keywords mentioned in each text.
The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output)
.
Parameters
Whether to enable support for additional languages. By default, Arabic (“ar”), Catalan (“ca”), Basque (“eu”), and Turkish (“tu”) are not enabled, since they’re supported only by a different class of language models (stanfordNLP’s Stanza) that is much slower than the rest. This parameter can be used to enable them.
Minimum number (or proportion) of texts to include a language in processing. Any texts in a language with fewer documents than these will be ignored. Can be useful to speed up processing when there is noise in the input languages, and when ignoring languages with a small number of documents only is acceptable. Values smaller than 1 will be interpreted as a proportion of all texts, and values greater than or equal to 1 as an absolute number of documents.
The language of inputs texts.
If all texts are in the same language, it can be specified here instead of passing it as an input column. The language will be used to identify the correct spaCy model to parse and analyze the texts. For allowed values, see the comment regarding the lang
column above.
Configure how keywords are extracted. Define the text elements considered keywords and their minimum or maximum frequency in the dataset to be included in the result.
Properties
Part-Of-Speech (POS) tags. Which lexical units (nouns, verbs etc.) to include as keywords. See spaCy’s universal part-of-speech tags for a detailed table of allowed values.
Array items
Each item in array.
Values must be one of the following:
ADJ
ADP
ADV
AUX
CONJ
CCONJ
DET
INTJ
NOUN
NUM
PART
PRON
PROPN
PUNCT
SCONJ
SYM
VERB
Whether or not to include any detected entities (people, places, events, etc.).
Whether or not to include compound noun phrases (such as ‘the quick red fox’).
Filter keywords based on the number of texts they occur in. Filter conditions can be applied globally or per language.
Properties
Minimum number of rows. Keywords not occurring in at least these many rows (texts) will be excluded.
Values must be in the following range:
Maximum proportion of rows. Keywords occurring in more than this proportion of rows (texts) will be excluded.
Values must be in the following range:
Whether to always include the n most frequent keywords.
I.e. independent of any other filter conditions. Set to null
to ignore.
Values must be in the following range:
Whether to exclude n most frequent keywords. I.e. independent of any other filter conditions.
Values must be in the following range:
Filter per language. Apply filter conditions separately to texts grouped by language, rather than across all texts.
Parse and extract keywords from texts.
The text elements considered keywords are configurable. They can include detected noun phrases (compound nouns like ‘the quick brown fox’), any automatically recognized entities (people, products, events), or any lexical category of word, such as nouns, verbs, adjectives etc.
The following examples show how the step can be used in a recipe.
Examples
To extract all kinds of nouns only, i.e. entities, compound nouns, simple nouns and proper nouns (names):
To extract all kinds of nouns only, i.e. entities, compound nouns, simple nouns and proper nouns (names):
To also include adjectives, and limit keywords to those that occur in at least 3 but no more than 90% of all documents:
General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.
The following are the inputs expected by the step and the outputs it produces. These are generally
columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Inputs
A text column to extract keywords from.
An (optional) column identifying the languages of the corresponding texts. It is used to identify the correct model (spaCy)
to use for each text. If the dataset doesn’t contain such a column yet, it can be created using the infer_language
step.
Ideally, languages should be expressed as two-letter
ISO 639-1 language codes, such as “en”, “es” or “de” for
English, Spanish or German respectively. We also detect fully spelled out names such as “english”, “German”, “allemande”
etc., but it is not guaranteed that we will recognize all possible spellings correctly always, so ISO codes should be
preferred.
Alternatively, if all texts are in the same language, it can be identified with the language
parameter instead.
Outputs
Lists containing the keywords mentioned in each text.
The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output)
.
Parameters
Whether to enable support for additional languages. By default, Arabic (“ar”), Catalan (“ca”), Basque (“eu”), and Turkish (“tu”) are not enabled, since they’re supported only by a different class of language models (stanfordNLP’s Stanza) that is much slower than the rest. This parameter can be used to enable them.
Minimum number (or proportion) of texts to include a language in processing. Any texts in a language with fewer documents than these will be ignored. Can be useful to speed up processing when there is noise in the input languages, and when ignoring languages with a small number of documents only is acceptable. Values smaller than 1 will be interpreted as a proportion of all texts, and values greater than or equal to 1 as an absolute number of documents.
The language of inputs texts.
If all texts are in the same language, it can be specified here instead of passing it as an input column. The language will be used to identify the correct spaCy model to parse and analyze the texts. For allowed values, see the comment regarding the lang
column above.
Configure how keywords are extracted. Define the text elements considered keywords and their minimum or maximum frequency in the dataset to be included in the result.
Properties
Part-Of-Speech (POS) tags. Which lexical units (nouns, verbs etc.) to include as keywords. See spaCy’s universal part-of-speech tags for a detailed table of allowed values.
Array items
Each item in array.
Values must be one of the following:
ADJ
ADP
ADV
AUX
CONJ
CCONJ
DET
INTJ
NOUN
NUM
PART
PRON
PROPN
PUNCT
SCONJ
SYM
VERB
Whether or not to include any detected entities (people, places, events, etc.).
Whether or not to include compound noun phrases (such as ‘the quick red fox’).
Filter keywords based on the number of texts they occur in. Filter conditions can be applied globally or per language.
Properties
Minimum number of rows. Keywords not occurring in at least these many rows (texts) will be excluded.
Values must be in the following range:
Maximum proportion of rows. Keywords occurring in more than this proportion of rows (texts) will be excluded.
Values must be in the following range:
Whether to always include the n most frequent keywords.
I.e. independent of any other filter conditions. Set to null
to ignore.
Values must be in the following range:
Whether to exclude n most frequent keywords. I.e. independent of any other filter conditions.
Values must be in the following range:
Filter per language. Apply filter conditions separately to texts grouped by language, rather than across all texts.