Parse texts and separate them into lists of tokens (words, lemmas, etc.).
Examples
ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Inputs
infer_language
step.
Ideally, languages should be expressed as two-letter
ISO 639-1 language codes, such as “en”, “es” or “de” for
English, Spanish or German respectively. We also detect fully spelled out names such as “english”, “German”, “allemande”
etc., but it is not guaranteed that we will recognize all possible spellings correctly always, so ISO codes should be
preferred.Alternatively, if all texts are in the same language, it can be identified with the language
parameter instead.Outputs
step(..., {"param": "value", ...}) -> (output)
.
Parameters
Options
lang
column above.Properties
orth
lemma
lower
text
stops
), URLs (urls
), punctuation (punct
), digits,
tokens containing non-alphabetic characters (non_alpha
), and tokens containing non-ascii
characters (non_ascii
).Array items
stops
urls
punct
digits
non_alpha
non_ascii
Properties
null
to ignore.Values must be in the following range: