Skip to content

Tokenize

NLP · text

Parse texts and separate them into lists of tokens (words, lemmas, etc.).

Example

E.g. to convert text strings to lists of lower-cased words, ignoring any tokens that represent punctuation (punct), URLs and stop words (stops), use:

tokenize(ds.text, ds.lang, {
  "tokens": {
    "exclude": ["punct", "urls", "stops"]
  }
}) -> (ds.tokens)

Usage

The following are the step's expected inputs and outputs and their specific types.

tokenize(
    text: text,
    lang: category, 
    {
        "param": value
    }
) -> (tokens: list[category])

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Inputs


text: column:text

A column of texts to separate into tokens.


lang: column:category

A column identifying the languages of the corresponding texts. If the dataset doesn't contain such a column yet, it can be created using the infer_language step. Ideally, languages should be expressed as two-letter ISO 639-1 language codes, such as "en", "es" or "de" for English, Spanish or German respectively. We also detect fully spelled out names such as "english", "German", "allemande" etc., but it is not guaranteed that we will recognize all possible spellings correctly always, so ISO codes should be preferred.

Outputs


tokens: column:list[category]

A column of lists containing the tokens extracted from the texts.

Parameters


extended_language_support: boolean = False

Whether to enable support for additional languages. By default, Catalan and Basque are not enabled, since they're supported only by a different class of language models that is much slower than the rest. This parameter can be used to enable them.


min_language_freq: number | integer = 0.02

Minimum number (or proportion) of texts to include a language in processing. Any texts in a language with fewer documents than these will be ignored. Can be useful to speed up processing when there is noise in the input languages, and when ignoring languages with a small number of documents only is acceptable. Values smaller than 1 will be interpreted as a proportion of all texts, and values greater than or equal to 1 as an absolute number of documents.


tokens: object

Configure how tokens are extracted and represented in the output. Define the kinds of tokens to extract, how to represent them, and their minimum or maximum frequency in the dataset to be included in the result.

Items in tokens

attrib: string = "lower"

Representation of the individual tokens to extract. I.e. whether verbatim (text/ortho), lower(-case) or lemmatized. Also see spaCy's attribute reference in this table for further information.

Must be one of: "orth", "lemma", "lower", "text"


exclude: array[string] = ['punct', 'urls']

Which kinds of tokens to exclude. Valid filters are stop words (stops), URLs (urls), punctuation (punct), digits, tokens containing non-alphabetic characters (non_alpha), and tokens containing non-ascii characters (non_ascii).

Items in exclude

item: string

Must be one of: "stops", "urls", "punct", "digits", "non_alpha", "non_ascii"


frequency_filter: object

Token frequency filter. Filters tokens based on the number of texts they occur in.

Items in frequency_filter

min_rows: integer = 2

Minimum number of rows. Tokens not occurring in at least these many rows (texts) will be excluded.

Range: 0 ≤ min_rows < inf


max_rows: number = 0.5

Maximum proportion of rows. Tokens occurring in more than this proportion of rows (texts) will be excluded.

Range: 0 ≤ max_rows ≤ 1


keep_top_n: integer | null

Keep n most frequent tokens. Whether to always include the n most frequent tokens, independent of the other filter parameters. Set to null to ignore.

Range: 0 ≤ keep_top_n < inf


filter_top_n: integer = 0

Exclude n most frequent tokens. Even if they passed the other filter conditions.

Range: 0 ≤ filter_top_n < inf


by_lang: boolean = False

Filter per language. Apply filter conditions separately to texts grouped by language, rather than across all texts.