extract_entities

Generates one column per type of entity (see below), each containing lists of entities detected in the corresponding text.

Usage

The following example shows how the step can be used in a recipe.

Examples

Example 1
Signature

To extract entities for all languages supported by default simply use the following code. Otherwise see parameters below.

extract_entities(ds.text, ds.lang) -> (ds.emoji)

General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.

extract_entities(text: text, *lang: category, {
    "param": value,
    ...
}) -> (
	People: list[category],
	Groups: list[category],
	Organizations: list[category],
	GPEs: list[category],
	Locations: list[category],
	Products: list[category],
	Events: list[category],
	Money: list[category]
)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Inputs

text

column[text]

required

A text column to extract entities from.

*lang

column[category]

An (optional) column identifying the languages of the corresponding texts. It is used to identify the correct model (spaCy) to use for each text. If the dataset doesn’t contain such a column yet, it can be created using the infer_language step. Ideally, languages should be expressed as two-letter ISO 639-1 language codes, such as “en”, “es” or “de” for English, Spanish or German respectively. We also detect fully spelled out names such as “english”, “German”, “allemande” etc., but it is not guaranteed that we will recognize all possible spellings correctly always, so ISO codes should be preferred.Alternatively, if all texts are in the same language, it can be identified with the language parameter instead.

Outputs

People

column[list[category]]

required

Lists of people detected in the texts (including fictional).

Groups

column[list[category]]

required

Lists of nationalities and religious or political groups detected in the texts.

Organizations

column[list[category]]

required

Lists of organizations detected in the texts (companies, agencies, institutions, etc.).

GPEs

column[list[category]]

required

Lists of geo-political entities detected in the texts, i.e. countries, cities, states etc.

Locations

column[list[category]]

required

Lists of locations detected in the texts, other than GPEs, such as mountain ranges, bodies of water etc.

Products

column[list[category]]

required

Lists of products detected in the texts (objects, vehicles, foods, etc., not services).

Events

column[list[category]]

required

Lists of events detected in the texts (e.g. named hurricanes, battles, wars, sports events, etc.).

Money

column[list[category]]

required

List of monetary values detected in the texts, including unit.

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

extended_language_support

boolean

default:"false"

Whether to enable support for additional languages. By default, Arabic (“ar”), Catalan (“ca”), Basque (“eu”), and Turkish (“tu”) are not enabled, since they’re supported only by a different class of language models (stanfordNLP’s Stanza) that is much slower than the rest. This parameter can be used to enable them.

min_language_freq

[number, integer]

default:"0.02"

Minimum number (or proportion) of texts to include a language in processing. Any texts in a language with fewer documents than these will be ignored. Can be useful to speed up processing when there is noise in the input languages, and when ignoring languages with a small number of documents only is acceptable. Values smaller than 1 will be interpreted as a proportion of all texts, and values greater than or equal to 1 as an absolute number of documents.

Options

number
integer

{_}

number

number.Values must be in the following range:

0 < {_} < 1

{_}

integer

integer.Values must be in the following range:

1 ≤ {_} < inf

language

[string, null]

The language of inputs texts. If all texts are in the same language, it can be specified here instead of passing it as an input column. The language will be used to identify the correct spaCy model to parse and analyze the texts. For allowed values, see the comment regarding the lang column above.

Analyse

Report

Prepare

Usage

Inputs & Outputs

Configuration

Analyse

Report

Prepare

​Usage

​Inputs & Outputs

​Configuration

Usage

Inputs & Outputs

Configuration