Skip to content

Extract entities

NLP ยท text

Parse texts and extract the entities mentioned (persons, organizations etc.).

Generates one column per type of entity (see below), each containing lists of entities detected in the corresponding text.

Example

To extract entities for all languages supported by default simply use the following code. Otherwise see parameters below.

extract_entities(ds.text, ds.lang) -> (ds.emoji)

Usage

The following are the step's expected inputs and outputs and their specific types.

extract_entities(
    text: text,
    lang: category, 
    {
        "param": value
    }
) -> (
    People: list[category],
    Groups: list[category],
    Organizations: list[category],
    GPEs: list[category],
    Locations: list[category],
    Products: list[category],
    Events: list[category],
    Money: list[category]
)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Inputs


text: column:text

A text column to extract entities from.


lang: column:category

A column identifying the languages of the corresponding texts. If the dataset doesn't contain such a column yet, it can be created using the infer_language step. Ideally, languages should be expressed as two-letter ISO 639-1 language codes, such as "en", "es" or "de" for English, Spanish or German respectively. We also detect fully spelled out names such as "english", "German", "allemande" etc., but it is not guaranteed that we will recognize all possible spellings correctly always, so ISO codes should be preferred.

Outputs


People: column:list[category]

Lists of people detected in the texts (including fictional).


Groups: column:list[category]

Lists of nationalities and religious or political groups detected in the texts.


Organizations: column:list[category]

Lists of organizations detected in the texts (companies, agencies, institutions, etc.).


GPEs: column:list[category]

Lists of geo-political entities detected in the texts, i.e. countries, cities, states etc.


Locations: column:list[category]

Lists of locations detected in the texts, other than GPEs, such as mountain ranges, bodies of water etc.


Products: column:list[category]

Lists of products detected in the texts (objects, vehicles, foods, etc., not services).


Events: column:list[category]

Lists of events detected in the texts (e.g. named hurricanes, battles, wars, sports events, etc.).


Money: column:list[category]

List of monetary values detected in the texts, including unit.

Parameters


extended_language_support: boolean = False

Whether to enable support for additional languages. By default, Catalan and Basque are not enabled, since they're supported only by a different class of language models that is much slower than the rest. This parameter can be used to enable them.


min_language_freq: number | integer = 0.02

Minimum number (or proportion) of texts to include a language in processing. Any texts in a language with fewer documents than these will be ignored. Can be useful to speed up processing when there is noise in the input languages, and when ignoring languages with a small number of documents only is acceptable. Values smaller than 1 will be interpreted as a proportion of all texts, and values greater than or equal to 1 as an absolute number of documents.