Parse texts and extract their n-grams.
Examples
ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Inputs
infer_language
step.
Ideally, languages should be expressed as two-letter
ISO 639-1 language codes, such as “en”, “es” or “de” for
English, Spanish or German respectively. We also detect fully spelled out names such as “english”, “German”, “allemande”
etc., but it is not guaranteed that we will recognize all possible spellings correctly always, so ISO codes should be
preferred.Alternatively, if all texts are in the same language, it can be identified with the language
parameter instead.Outputs
step(..., {"param": "value", ...}) -> (output)
.
Parameters
Options
lang
column above.Properties
stops
is included in filters, n-grams containing stopwords as the first and/or last token are also excluded.Array items
punct
stops
url
digits
non_alpha
non_ascii
orth
lemma
lower
text
attrib
parameter.Properties
null
to ignore.Values must be in the following range: