Skip to content

Extract keywords

NLP · text

Parse and extract keywords from texts.

The text elements considered keywords are configurable. They can include detected noun phrases (compound nouns like 'the quick brown fox'), any automatically recognized entities (people, products, events), or any lexical category of word, such as nouns, verbs, adjectives etc.


To extract all kinds of nouns only, i.e. entities, compound nouns, simple nouns and proper nouns (names):

extract_keywords(ds.text, ds.lang,
    "keywords": {
      "entities": true,
      "noun_phrases": true,
      "pos_tags": ["NOUN", "PROPN"]
  }) -> (ds.keywords)
More examples

To also include adjectives, and limit keywords to those that occur in at least 3 but no more than 90% of all documents:

extract_keywords(ds.text, ds.lang,
    "keywords": {
      "entities": true,
      "noun_phrases": true,
      "pos_tags": ["NOUN", "PROPN", "ADJ"],
      "frequency_filter": {
          "min_rows": 3,
          "max_rows": 0.9
  }) -> (ds.keywords)


The following are the step's expected inputs and outputs and their specific types.

    text: text,
    lang: category, 
        "param": value
) -> (keywords: list[category])

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.


text: column:text

A text column to extract keywords from.

lang: column:category

A column identifying the languages of the corresponding texts. If the dataset doesn't contain such a column yet, it can be created using the infer_language step. Ideally, languages should be expressed as two-letter ISO 639-1 language codes, such as "en", "es" or "de" for English, Spanish or German respectively. We also detect fully spelled out names such as "english", "German", "allemande" etc., but it is not guaranteed that we will recognize all possible spellings correctly always, so ISO codes should be preferred.


keywords: column:list[category]

Lists containing the keywords mentioned in each text.


extended_language_support: boolean = False

Whether to enable support for additional languages. By default, Catalan and Basque are not enabled, since they're supported only by a different class of language models that is much slower than the rest. This parameter can be used to enable them.

min_language_freq: number | integer = 0.02

Minimum number (or proportion) of texts to include a language in processing. Any texts in a language with fewer documents than these will be ignored. Can be useful to speed up processing when there is noise in the input languages, and when ignoring languages with a small number of documents only is acceptable. Values smaller than 1 will be interpreted as a proportion of all texts, and values greater than or equal to 1 as an absolute number of documents.

keywords: object

Configure how keywords are extracted. Define the text elements considered keywords and their minimum or maximum frequency in the dataset to be included in the result.

Items in keywords

pos_tags: array[string] = ['NOUN', 'PROPN', 'ADJ']

Part-Of-Speech (POS) tags. Which lexical units (nouns, verbs etc.) to include as keywords. See spaCy's universal part-of-speech tags for a detailed table of allowed values.

Items in pos_tags

item: string

Must be one of: "ADJ", "ADP", "ADV", "AUX", "CONJ", "CCONJ", "DET", "INTJ", "NOUN", "NUM", "PART", "PRON", "PROPN", "PUNCT", "SCONJ", "SYM", "VERB"

entities: boolean = True

Whether or not to include any detected entities (people, places, events, etc.).

noun_phrases: boolean = True

Whether or not to include compound noun phrases (such as 'the quick red fox').

frequency_filter: object

Filter keywords based on the number of texts they occur in. Filter conditions can be applied globally or per language.

Items in frequency_filter

min_rows: integer = 2

Minimum number of rows. Keywords not occurring in at least these many rows (texts) will be excluded.

Range: 0 ≤ min_rows < inf

max_rows: number = 0.5

Maximum proportion of rows. Keywords occurring in more than this proportion of rows (texts) will be excluded.

Range: 0 ≤ max_rows ≤ 1

keep_top_n: integer

Whether to always include the n most frequent keywords. I.e. independent of any other filter conditions. Set to null to ignore.

Range: 0 ≤ keep_top_n < inf

filter_top_n: integer = 0

Whether to exclude n most frequent keywords. I.e. independent of any other filter conditions.

Range: 0 ≤ filter_top_n < inf

by_lang: boolean = False

Filter per language. Apply filter conditions separately to texts grouped by language, rather than across all texts.