> ## Documentation Index
> Fetch the complete documentation index at: https://docs.graphext.com/llms.txt
> Use this file to discover all available pages before exploring further.

# infer_language

> Detect the language used for each text in the input column. 

Each language will be represented by its [ISO 639-1 language code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes),
such as "en", "es", "it" for English, Spanish and Italian respectively.

## Usage

The following example shows how the step can be used in a recipe.

<Accordion title="Examples" icon="code" defaultOpen="true">
  <Tabs>
    <Tab title="Example 1">
      In most cases no special configuration should be necessary, so simply

      ```stan theme={null}
      infer_language(ds.text) -> (ds.language)
      ```
    </Tab>

    <Tab title="Signature">
      General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.

      ```stan theme={null}
      infer_language(text: text, {
          "param": value,
          ...
      }) -> (lang: category)
      ```
    </Tab>
  </Tabs>
</Accordion>

## Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally
columns (`ds.first_name`), datasets (`ds` or `ds[["first_name", "last_name"]]`) or models (referenced
by name e.g. `"churn-clf"`).

<Accordion title="Inputs" icon="right-to-bracket">
  <ParamField path="text" type="column[text]" required>
    A text column to detect languages for.
  </ParamField>
</Accordion>

<Accordion title="Outputs" icon="right-from-bracket">
  <ParamField path="lang" type="column[category]" required>
    A column identifying the language of each text using its two-letter
    [ISO 639-1 language code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes).
  </ParamField>
</Accordion>

## Configuration

The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last "input" to the step, i.e. `step(..., {"param": "value", ...}) -> (output)`.

<Accordion title="Parameters" defaultOpen="true" icon="sliders">
  <ParamField path="model" type="string" default="lingua">
    Which model to use to detect languages.
    Select from one of four model types (corresponding to specific Python libraries):

    * `"lingua"`: [https://github.com/pemistahl/lingua-py](https://github.com/pemistahl/lingua-py)
    * `"fasttext"`: [https://fasttext.cc/docs/en/language-identification.html](https://fasttext.cc/docs/en/language-identification.html)
    * `"langdetect"`: [https://github.com/Mimino666/langdetect](https://github.com/Mimino666/langdetect)
    * `"langid"`: [https://github.com/saffsd/langid.py](https://github.com/saffsd/langid.py).

    Values must be one of the following:

    * `lingua`
    * `fasttext`
    * `langdetect`
    * `langid`
  </ParamField>

  <ParamField path="lowercase" type="boolean" default="true">
    Whether to lowercase texts before detection.
    Some models may be more sensitive than others if texts are in capital letters only, for example.
  </ParamField>

  <ParamField path="min_probability" type="number" default="0.0">
    Minimum probability to assign a language for a particular text.
    If the model used to infer the language is less sure about a language than this, the corresponding
    text will be assigned the "undefined" language ("und"). Note that a reasonable value might depend
    on the specific model used. Different models may produce different distributions of detection confidence.

    Values must be in the following range:

    ```javascript theme={null}
    0 ≤ min_probability ≤ 1
    ```
  </ParamField>

  <ParamField path="allowed_languages" type="array[string]" default="['ar', 'ca', 'da', 'de', 'el', 'en', 'es', 'eu', 'fi', 'fr', 'hr', 'it', 'ja', 'lt', 'nb', 'nl', 'pl', 'pt', 'ro', 'sv', 'tr']">
    Restrict which languages can be be inferred.
    Can be used to limit language detection to a smaller set if necessary. By default (when
    not specifying this parameter, or when setting it to `true` or `null`), we restrict this
    to the languages which we have spaCy models for, because this is the most common use of
    language detection in Graphext (applying the correct spaCy language model to extract keywords
    e.g.).

    If set to `false`, will allow detection of all languages supported by the selected model.

    If set to a list of ISO 639-1 codes, only these languages are detected (if supported by
    the model).

    <Accordion title="Array items">
      <ParamField path="Item" type="string">
        Each item in array.
      </ParamField>
    </Accordion>
  </ParamField>
</Accordion>
