Usage
The following example shows how the step can be used in a recipe.Examples
Examples
Using a custom configuration to select the maximum size (n) of the n-grams, and how to represent them:
Inputs & Outputs
The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Inputs
Inputs
A text column to extract n-grams from.
An (optional) column identifying the languages of the corresponding texts. It is used to identify the correct model (spaCy)
to use for each text. If the dataset doesn’t contain such a column yet, it can be created using the
infer_language
step.
Ideally, languages should be expressed as two-letter
ISO 639-1 language codes, such as “en”, “es” or “de” for
English, Spanish or German respectively. We also detect fully spelled out names such as “english”, “German”, “allemande”
etc., but it is not guaranteed that we will recognize all possible spellings correctly always, so ISO codes should be
preferred.Alternatively, if all texts are in the same language, it can be identified with the language
parameter instead.Outputs
Outputs
A column of lists containing the n-grams extracted from the texts.
Configuration
The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e.step(..., {"param": "value", ...}) -> (output)
.
Parameters
Parameters
Whether to enable support for additional languages.
By default, Arabic (“ar”), Catalan (“ca”), Basque (“eu”), and Turkish (“tu”) are not enabled,
since they’re supported only by a different class of language models (stanfordNLP’s Stanza)
that is much slower than the rest. This parameter can be used to enable them.
Minimum number (or proportion) of texts to include a language in processing.
Any texts in a language with fewer documents than these will be ignored. Can be useful to speed up
processing when there is noise in the input languages, and when ignoring languages with a small number of
documents only is acceptable. Values smaller than 1 will be interpreted as a proportion of all texts, and
values greater than or equal to 1 as an absolute number of documents.
Options
Options
number.Values must be in the following range:
The language of inputs texts.
If all texts are in the same language, it can be specified here instead of passing it as an input column. The language will be used to identify the correct spaCy model to parse and analyze the texts. For allowed values, see the comment regarding the
lang
column above.N-gram configuration.
Configure maximum size, which words/tokens to exclude, and how to represent the n-grams in the result.
Properties
Properties
N-grams with up to this number of words will be extracted.Values must be in the following range:
Exclude these kinds of tokens from n-grams.
For longer n-grams, those containing either extraneous whitespace or any punctuation are automatically excluded. Additionally,
if
stops
is included in filters, n-grams containing stopwords as the first and/or last token are also excluded.Array items
Array items
Each item in array.Values must be one of the following:
punct
stops
url
digits
non_alpha
non_ascii
Representation of the individual words/tokens to extract.
I.e. whether verbatim (text/ortho), lower(-case) or lemmatized. Also see spaCy’s attribute reference
in this table for further information.Values must be one of the following:
orth
lemma
lower
text
Whether unigrams should always be extracted lemmatized, irrespective of the
attrib
parameter.Whether to separate the words in n-grams with an underscore character instead of a space.
N-gram frequency filter.
Filters n-grams based on the number of texts they occur in.
Properties
Properties
Minimum number of rows.
N-grams not occurring in at least these many rows (texts) will be excluded.Values must be in the following range:
Maximum proportion of rows.
N-grams occurring in more than this proportion of rows (texts) will be excluded.Values must be in the following range:
Keep n most frequent keywords.
Whether to always include the n most frequent n-grams, independent of the other filter parameters.
Set to
null
to ignore.Values must be in the following range:Exclude n most frequent keywords.
Exclude the n most frequent n-grams, even if they passed the other filter conditions.Values must be in the following range:
Filter per language.
Apply filter conditions separately to texts grouped by language, rather than across all texts.