An n-gram here means a contiguous sequence of n words in the original text. The step extracts all n-grams of a given text, i.e. starting at each individual word in original order. The result is one list of n-grams per input text, where each n-gram is a single text string with individual words separated by spaces (unless configured otherwise). The maximum size and kind of n-grams extracted, as well as how to represent them in the result can be configured via the parameters described below. The step also allows filtering of n-grams based on their frequency in the dataset.

extended_language_support
boolean

Whether to enable support for additional languages. By default, Arabic (“ar”), Catalan (“ca”), Basque (“eu”), and Turkish (“tu”) are not enabled, since they’re supported only by a different class of language models (stanfordNLP’s Stanza) that is much slower than the rest. This parameter can be used to enable them.

min_language_freq
[number, integer]
default: "0.02"

Minimum number (or proportion) of texts to include a language in processing. Any texts in a language with fewer documents than these will be ignored. Can be useful to speed up processing when there is noise in the input languages, and when ignoring languages with a small number of documents only is acceptable. Values smaller than 1 will be interpreted as a proportion of all texts, and values greater than or equal to 1 as an absolute number of documents.

language
[string, null]

The language of inputs texts. If all texts are in the same language, it can be specified here instead of passing it as an input column. The language will be used to identify the correct spaCy model to parse and analyze the texts. For allowed values, see the comment regarding the lang column above.

ngrams
object

N-gram configuration. Configure maximum size, which words/tokens to exclude, and how to represent the n-grams in the result.