Skip to content

Extract ngrams


Parse texts and extract their n-grams.

An n-gram here means a contiguous sequence of n words in the original text. The step extracts all n-grams of a given text, i.e. starting at each individual word in original order. The result is one list of n-grams per input text, where each n-gram is a single text string with individual words separated by spaces (unless configured otherwise). The maximum size and kind of n-grams extracted, as well as how to represent them in the result can be configured via the parameters described below. The step also allows filtering of n-grams based on their frequency in the dataset.


The following are the step's expected inputs and outputs and their specific types.

Step signature
    text: text,
    *lang: category, {
    "param": value
}) -> (ngrams: list[category])

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.


Using a custom configuration to select the maximum size (n) of the n-grams, and how to represent them:

Example call (in recipe editor)
extract_ngrams(ds.text, ds.lang, {
  "ngrams": {
        "n_max": 4,
        "filters": ["punct", "stops"],
        "attrib": "lower",
        "unigram_lemmas": true,
        "concat": false
}) -> (ds.ngrams)


text: column:text

A text column to extract n-grams from.

*lang: column:category

An (optional) column identifying the languages of the corresponding texts. It is used to identify the correct model (spaCy) to use for each text. If the dataset doesn't contain such a column yet, it can be created using the infer_language step. Ideally, languages should be expressed as two-letter ISO 639-1 language codes, such as "en", "es" or "de" for English, Spanish or German respectively. We also detect fully spelled out names such as "english", "German", "allemande" etc., but it is not guaranteed that we will recognize all possible spellings correctly always, so ISO codes should be preferred.

Alternatively, if all texts are in the same language, it can be identified with the language parameter instead.


ngrams: column:list[category]

A column of lists containing the n-grams extracted from the texts.


extended_language_support: boolean = False

Whether to enable support for additional languages. By default, Arabic ("ar"), Catalan ("ca"), Basque ("eu"), and Turkish ("tu") are not enabled, since they're supported only by a different class of language models (stanfordNLP's Stanza) that is much slower than the rest. This parameter can be used to enable them.

min_language_freq: number | integer = 0.02

Minimum number (or proportion) of texts to include a language in processing. Any texts in a language with fewer documents than these will be ignored. Can be useful to speed up processing when there is noise in the input languages, and when ignoring languages with a small number of documents only is acceptable. Values smaller than 1 will be interpreted as a proportion of all texts, and values greater than or equal to 1 as an absolute number of documents.

language: string | null

The language of inputs texts. If all texts are in the same language, it can be specified here instead of passing it as an input column. The language will be used to identify the correct spaCy model to parse and analyze the texts. For allowed values, see the comment regarding the lang column above.

ngrams: object

N-gram configuration. Configure maximum size, which words/tokens to exclude, and how to represent the n-grams in the result.

Items in ngrams

n_max: number = 4

N-grams with up to this number of words will be extracted.

Range: 1 ≤ n_max ≤ 4

filters: array[string] = ['punct', 'stops']

Exclude these kinds of tokens from n-grams. For longer n-grams, those containing either extraneous whitespace or any punctuation are automatically excluded. Additionally, if stops is included in filters, n-grams containing stopwords as the first and/or last token are also excluded.

Items in filters

item: string

Must be one of: "punct", "stops", "url", "digits", "non_alpha", "non_ascii"

attrib: string = "lower"

Representation of the individual words/tokens to extract. I.e. whether verbatim (text/ortho), lower(-case) or lemmatized. Also see spaCy's attribute reference in this table for further information.

Must be one of: "orth", "lemma", "lower", "text"

unigram_lemmas: boolean = True

Whether unigrams should always be extracted lemmatized, irrespective of the attrib parameter.

concat: boolean = False

Whether to separate the words in n-grams with an underscore character instead of a space.

frequency_filter: object

N-gram frequency filter. Filters n-grams based on the number of texts they occur in.

Items in frequency_filter

min_rows: integer = 2

Minimum number of rows. N-grams not occurring in at least these many rows (texts) will be excluded.

Range: 0 ≤ min_rows < inf

max_rows: number = 0.5

Maximum proportion of rows. N-grams occurring in more than this proportion of rows (texts) will be excluded.

Range: 0 ≤ max_rows ≤ 1

keep_top_n: integer | null

Keep n most frequent keywords. Whether to always include the n most frequent n-grams, independent of the other filter parameters. Set to null to ignore.

Range: 0 ≤ keep_top_n < inf

filter_top_n: integer = 0

Exclude n most frequent keywords. Exclude the n most frequent n-grams, even if they passed the other filter conditions.

Range: 0 ≤ filter_top_n < inf

by_lang: boolean = False

Filter per language. Apply filter conditions separately to texts grouped by language, rather than across all texts.