tokenize

Usage

The following example shows how the step can be used in a recipe.

Examples

E.g. to convert text strings to lists of lower-cased words, ignoring any tokens that represent punctuation (punct), URLs and stop words (stops), use:

tokenize(ds.text, ds.lang, {
  "tokens": {
    "exclude": ["punct", "urls", "stops"]
  }
}) -> (ds.tokens)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Inputs

Outputs

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

extended_language_support

boolean

default:"false"

Whether to enable support for additional languages. By default, Arabic (“ar”), Catalan (“ca”), Basque (“eu”), and Turkish (“tu”) are not enabled, since they’re supported only by a different class of language models (stanfordNLP’s Stanza) that is much slower than the rest. This parameter can be used to enable them.

min_language_freq

[number, integer]

default:"0.02"

Minimum number (or proportion) of texts to include a language in processing. Any texts in a language with fewer documents than these will be ignored. Can be useful to speed up processing when there is noise in the input languages, and when ignoring languages with a small number of documents only is acceptable. Values smaller than 1 will be interpreted as a proportion of all texts, and values greater than or equal to 1 as an absolute number of documents.

Options

language

[string, null]

The language of inputs texts. If all texts are in the same language, it can be specified here instead of passing it as an input column. The language will be used to identify the correct spaCy model to parse and analyze the texts. For allowed values, see the comment regarding the lang column above.

tokens

object

Configure how tokens are extracted and represented in the output. Define the kinds of tokens to extract, how to represent them, and their minimum or maximum frequency in the dataset to be included in the result.

Properties

attrib

string

default:"lower"

Representation of the individual tokens to extract. I.e. whether verbatim (text/ortho), lower(-case) or lemmatized. Also see spaCy’s attribute reference in this table for further information.Values must be one of the following:

orth
lemma
lower
text

exclude

array[string]

default:"['punct', 'urls']"

Which kinds of tokens to exclude. Valid filters are stop words (stops), URLs (urls), punctuation (punct), digits, tokens containing non-alphabetic characters (non_alpha), and tokens containing non-ascii characters (non_ascii).

Array items

frequency_filter

object

Token frequency filter. Filters tokens based on the number of texts they occur in.

Properties

min_rows

integer

default:"2"

Minimum number of rows. Tokens not occurring in at least these many rows (texts) will be excluded.Values must be in the following range:

0 ≤ min_rows < inf

max_rows

number

default:"0.5"

Maximum proportion of rows. Tokens occurring in more than this proportion of rows (texts) will be excluded.Values must be in the following range:

0 ≤ max_rows ≤ 1

keep_top_n

[integer, null]

Keep n most frequent tokens. Whether to always include the n most frequent tokens, independent of the other filter parameters. Set to null to ignore.Values must be in the following range:

0 ≤ keep_top_n < inf

filter_top_n

integer

default:"0"

Exclude n most frequent tokens. Even if they passed the other filter conditions.Values must be in the following range:

0 ≤ filter_top_n < inf

by_lang

boolean

default:"false"

Filter per language. Apply filter conditions separately to texts grouped by language, rather than across all texts.

Prepare

Report

Analyse

Usage

Inputs & Outputs

Configuration

Prepare

Report

Analyse

​Usage

​Inputs & Outputs

​Configuration

Usage

Inputs & Outputs

Configuration