Skip to content

Infer language

NLP · inference · model · text

Detect the language used for each text in the input column.

Each language will be represented by its ISO 639-1 language code, such as "en", "es", "it" for English, Spanish and Italian respectively.

Example

In most cases no special configuration should be necessary, so simply

infer_language(ds.text) -> (ds.language)

Usage

The following are the step's expected inputs and outputs and their specific types.

infer_language(text: text, {"param": value}) -> (lang: category)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Inputs


text: column:text

A text column to detect languages for.

Outputs


lang: column:category

A column identifying the language of each text using its two-letter ISO 639-1 language code.

Parameters


min_probability: number = 0.5

Minimum probability to assign a language for a particular text. If the model used to infer the language is less sure about a language than this, the corresponding text will be assigned no language, and will have a missing values instead (NaN).

Range: 0 ≤ min_probability ≤ 1


allowed_languages: array[string] = ['en', 'es', 'pt', 'fr', 'de', 'it', 'eu', 'ca', 'tr', 'ar']

Only these languages will be inferred. Can be used to limit language detection to a smaller set if necessary.

Items in allowed_languages

item: string

Must be one of: "en", "es", "pt", "fr", "de", "it", "eu", "ca", "tr", "ar"