extract_keywords
Parse and extract keywords from texts.
The text elements considered keywords are configurable. They can include detected noun phrases (compound nouns like ‘the quick brown fox’), any automatically recognized entities (people, products, events), or any lexical category of word, such as nouns, verbs, adjectives etc.
Usage
The following examples show how the step can be used in a recipe.
To extract all kinds of nouns only, i.e. entities, compound nouns, simple nouns and proper nouns (names):
Inputs & Outputs
The following are the inputs expected by the step and the outputs it produces. These are generally
columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Configuration
The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output)
.
Whether to enable support for additional languages. By default, Arabic (“ar”), Catalan (“ca”), Basque (“eu”), and Turkish (“tu”) are not enabled, since they’re supported only by a different class of language models (stanfordNLP’s Stanza) that is much slower than the rest. This parameter can be used to enable them.
Minimum number (or proportion) of texts to include a language in processing. Any texts in a language with fewer documents than these will be ignored. Can be useful to speed up processing when there is noise in the input languages, and when ignoring languages with a small number of documents only is acceptable. Values smaller than 1 will be interpreted as a proportion of all texts, and values greater than or equal to 1 as an absolute number of documents.
The language of inputs texts.
If all texts are in the same language, it can be specified here instead of passing it as an input column. The language will be used to identify the correct spaCy model to parse and analyze the texts. For allowed values, see the comment regarding the lang
column above.
Configure how keywords are extracted. Define the text elements considered keywords and their minimum or maximum frequency in the dataset to be included in the result.
Was this page helpful?