Skip to content

Infer language

NLPinferencemodeltext

Detect the language used for each text in the input column.

Each language will be represented by its ISO 639-1 language code, such as "en", "es", "it" for English, Spanish and Italian respectively.

Usage


The following are the step's expected inputs and outputs and their specific types.

Step signature
infer_language(text: text, {
    "param": value
}) -> (lang: category)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Example

In most cases no special configuration should be necessary, so simply

Example call (in recipe editor)
infer_language(ds.text) -> (ds.language)

Inputs


text: column:text

A text column to detect languages for.

Outputs


lang: column:category

A column identifying the language of each text using its two-letter ISO 639-1 language code.

Parameters


model: string = "lingua"

Which model to use to detect languages. Select from one of four model types (corresponding to specific Python libraries):

  • "lingua": https://github.com/pemistahl/lingua-py
  • "fasttext": https://fasttext.cc/docs/en/language-identification.html
  • "langdetect": https://github.com/Mimino666/langdetect
  • "langid": https://github.com/saffsd/langid.py.

Must be one of: "lingua", "fasttext", "langdetect", "langid"


lowercase: boolean = True

Whether to lowercase texts before detection. Some models may be more sensitive than others if texts are in capital letters only, for example.


min_probability: number = 0.0

Minimum probability to assign a language for a particular text. If the model used to infer the language is less sure about a language than this, the corresponding text will be assigned the "undefined" language ("und"). Note that a reasonable value might depend on the specific model used. Different models may produce different distributions of detection confidence.

Range: 0 ≤ min_probability ≤ 1


allowed_languages: array[string] = ['ar', 'ca', 'da', 'de', 'el', 'en', 'es', 'eu', 'fi', 'fr', 'hr', 'it', 'ja', 'lt', 'nb', 'nl', 'pl', 'pt', 'ro', 'sv', 'tr']

Restrict which languages can be be inferred. Can be used to limit language detection to a smaller set if necessary. By default (when not specifying this parameter, or when setting it to true or null), we restrict this to the languages which we have spaCy models for, because this is the most common use of language detection in Graphext (applying the correct spaCy language model to extract keywords e.g.).

If set to false, will allow detection of all languages supported by the selected model.

If set to a list of ISO 639-1 codes, only these languages are detected (if supported by the model).