Usage
The following example shows how the step can be used in a recipe.Examples
Examples
In most cases no special configuration should be necessary, so simply
Inputs & Outputs
The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Inputs
Inputs
A text column to detect languages for.
Outputs
Outputs
A column identifying the language of each text using its two-letter
ISO 639-1 language code.
Configuration
The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e.step(..., {"param": "value", ...}) -> (output)
.
Parameters
Parameters
Which model to use to detect languages.
Select from one of four model types (corresponding to specific Python libraries):
"lingua"
: https://github.com/pemistahl/lingua-py"fasttext"
: https://fasttext.cc/docs/en/language-identification.html"langdetect"
: https://github.com/Mimino666/langdetect"langid"
: https://github.com/saffsd/langid.py.
lingua
fasttext
langdetect
langid
Whether to lowercase texts before detection.
Some models may be more sensitive than others if texts are in capital letters only, for example.
Minimum probability to assign a language for a particular text.
If the model used to infer the language is less sure about a language than this, the corresponding
text will be assigned the “undefined” language (“und”). Note that a reasonable value might depend
on the specific model used. Different models may produce different distributions of detection confidence.Values must be in the following range:
Restrict which languages can be be inferred.
Can be used to limit language detection to a smaller set if necessary. By default (when
not specifying this parameter, or when setting it to
true
or null
), we restrict this
to the languages which we have spaCy models for, because this is the most common use of
language detection in Graphext (applying the correct spaCy language model to extract keywords
e.g.).If set to false
, will allow detection of all languages supported by the selected model.If set to a list of ISO 639-1 codes, only these languages are detected (if supported by
the model).Array items
Array items
Each item in array.