> ## Documentation Index
> Fetch the complete documentation index at: https://docs.graphext.com/llms.txt
> Use this file to discover all available pages before exploring further.

# tokenize

> Parse texts and separate them into lists of tokens (words, lemmas, etc.). 

## Usage

The following example shows how the step can be used in a recipe.

<Accordion title="Examples" icon="code" defaultOpen="true">
  <Tabs>
    <Tab title="Example 1">
      E.g. to convert text strings to lists of lower-cased words, ignoring any tokens that represent
      punctuation (punct), URLs and stop words (stops), use:

      ```stan theme={null}
      tokenize(ds.text, ds.lang, {
        "tokens": {
          "exclude": ["punct", "urls", "stops"]
        }
      }) -> (ds.tokens)
      ```
    </Tab>

    <Tab title="Signature">
      General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.

      ```stan theme={null}
      tokenize(text: text, *lang: category, {
          "param": value,
          ...
      }) -> (tokens: list[category])
      ```
    </Tab>
  </Tabs>
</Accordion>

## Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally
columns (`ds.first_name`), datasets (`ds` or `ds[["first_name", "last_name"]]`) or models (referenced
by name e.g. `"churn-clf"`).

<Accordion title="Inputs" icon="right-to-bracket">
  <ParamField path="text" type="column[text]" required>
    A column of texts to separate into tokens.
  </ParamField>

  <ParamField path="*lang" type="column[category]">
    An (optional) column identifying the languages of the corresponding texts. It is used to identify the correct model (spaCy)
    to use for each text. If the dataset doesn't contain such a column yet, it can be created using the `infer_language` step.
    Ideally, languages should be expressed as two-letter
    [ISO 639-1 language codes](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes), such as "en", "es" or "de" for
    English, Spanish or German respectively. We also detect fully spelled out names such as "english", "German", "allemande"
    etc., but it is not guaranteed that we will recognize all possible spellings correctly always, so ISO codes should be
    preferred.

    Alternatively, if all texts are in the same language, it can be identified with the `language` *parameter* instead.
  </ParamField>
</Accordion>

<Accordion title="Outputs" icon="right-from-bracket">
  <ParamField path="tokens" type="column[list[category]]" required>
    A column of lists containing the tokens extracted from the texts.
  </ParamField>
</Accordion>

## Configuration

The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last "input" to the step, i.e. `step(..., {"param": "value", ...}) -> (output)`.

<Accordion title="Parameters" defaultOpen="true" icon="sliders">
  <ParamField path="extended_language_support" type="boolean" default="false">
    Whether to enable support for additional languages.
    By default, Arabic ("ar"), Catalan ("ca"), Basque ("eu"), and Turkish ("tu") are not enabled,
    since they're supported only by a different class of language models (stanfordNLP's Stanza)
    that is much slower than the rest. This parameter can be used to enable them.
  </ParamField>

  <ParamField path="min_language_freq" type="[number, integer]" default="0.02">
    Minimum number (or proportion) of texts to include a language in processing.
    Any texts in a language with fewer documents than these will be ignored. Can be useful to speed up
    processing when there is noise in the input languages, and when ignoring languages with a small number of
    documents only is acceptable. Values smaller than 1 will be interpreted as a *proportion* of all texts, and
    values greater than or equal to 1 as an *absolute number* of documents.

    <Accordion title="Options">
      <Tabs>
        <Tab title="number">
          <ParamField path="{_}" type="number">
            number.

            Values must be in the following range:

            ```javascript theme={null}
            0 < {_} < 1
            ```
          </ParamField>
        </Tab>

        <Tab title="integer">
          <ParamField path="{_}" type="integer">
            integer.

            Values must be in the following range:

            ```javascript theme={null}
            1 ≤ {_} < inf
            ```
          </ParamField>
        </Tab>
      </Tabs>
    </Accordion>
  </ParamField>

  <ParamField path="language" type="[string, null]">
    The language of inputs texts.
    If all texts are in the same language, it can be specified here instead of passing it as an input column. The language will be used to identify the correct spaCy model to parse and analyze the texts. For allowed values, see the comment regarding the `lang` column above.
  </ParamField>

  <ParamField path="tokens" type="object">
    Configure how tokens are extracted and represented in the output.
    Define the kinds of tokens to extract, how to represent them, and their minimum or maximum frequency in the
    dataset to be included in the result.

    <Accordion title="Properties">
      <ParamField path="attrib" type="string" default="lower">
        Representation of the individual tokens to extract.
        I.e. whether verbatim (text/ortho), lower(-case) or lemmatized. Also see spaCy's attribute reference
        in [this table](https://spacy.io/api/token#attributes) for further information.

        Values must be one of the following:

        * `orth`
        * `lemma`
        * `lower`
        * `text`
      </ParamField>

      <ParamField path="exclude" type="array[string]" default="['punct', 'urls']">
        Which kinds of tokens to exclude.
        Valid filters are stop words (`stops`), URLs (`urls`), punctuation (`punct`), digits,
        tokens containing non-alphabetic characters (`non_alpha`), and tokens containing non-ascii
        characters (`non_ascii`).

        <Accordion title="Array items">
          <ParamField path="Item" type="string">
            Each item in array.

            Values must be one of the following:

            `stops` `urls` `punct` `digits` `non_alpha` `non_ascii`
          </ParamField>
        </Accordion>
      </ParamField>

      <ParamField path="frequency_filter" type="object">
        Token frequency filter.
        Filters tokens based on the number of texts they occur in.

        <Accordion title="Properties">
          <ParamField path="min_rows" type="integer" default="2">
            Minimum number of rows.
            Tokens not occurring in at least these many rows (texts) will be excluded.

            Values must be in the following range:

            ```javascript theme={null}
            0 ≤ min_rows < inf
            ```
          </ParamField>

          <ParamField path="max_rows" type="number" default="0.5">
            Maximum proportion of rows.
            Tokens occurring in more than this proportion of rows (texts) will be excluded.

            Values must be in the following range:

            ```javascript theme={null}
            0 ≤ max_rows ≤ 1
            ```
          </ParamField>

          <ParamField path="keep_top_n" type="[integer, null]">
            Keep n most frequent tokens.
            Whether to always include the n most frequent tokens, independent of the other filter parameters.
            Set to `null` to ignore.

            Values must be in the following range:

            ```javascript theme={null}
            0 ≤ keep_top_n < inf
            ```
          </ParamField>

          <ParamField path="filter_top_n" type="integer" default="0">
            Exclude n most frequent tokens.
            Even if they passed the other filter conditions.

            Values must be in the following range:

            ```javascript theme={null}
            0 ≤ filter_top_n < inf
            ```
          </ParamField>

          <ParamField path="by_lang" type="boolean" default="false">
            Filter per language.
            Apply filter conditions separately to texts grouped by language, rather than across all texts.
          </ParamField>
        </Accordion>
      </ParamField>
    </Accordion>
  </ParamField>
</Accordion>
