zeroshot_classify_text

In contrast with classify_text, this step doesn’t require a model specifically trained with the given labels. Any model from the Hugging Face hub that is compatible with their zeroshot classification pipeline can be used here. By default this is the (English) valhalla/distilbart-mnli-12-3, for a good trade-off between model size and accuracy. If a multilingual model is needed you could try e.g. joeddav/xlm-roberta-large-xnli. Note that we do not validate the model name before executing it, so make sure it corresponds to an existing model in the hub, otherwise the step will fail.

Usage

The following examples show how the step can be used in a recipe.

Examples

E.g., to classify English texts into the three topics sport, politics and business:

zeroshot_classify_text(ds.text, {"labels": ["sport", "politics", "business"]}) -> (ds.topic)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Inputs

Outputs

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

labels

array[string]

A list of labels/categories to automatically assign to each text. This can be somewhat of a black art. As a simple, if perhaps obvious heuristic, the fewer and less ambiguous the selected categories the faster and most probably accurate the resulting classification. As the number and ambiguity of categories increases one can expect less precise results.

Array items

model

string

default:"valhalla/distilbart-mnli-12-3"

The name of a model. This should be the full name (including the organization if applicable) of a model in the Hugging Face model hub. You can copy it by clicking on the icon next to the model’s name on its dedicated web page.Note that for now Hugging Face only supports models trained on NLI (natural language inference) tasks in their zeroshot pipeline. These can be recognized usually by mentioning nli, mnli, or xnli in their name. For further details on zeroshot learning using NLI models see e.g. here.Also, note that if the name doesn’t correspond to a model existing in the hub the step will fail.

Examples

template

string

A custom hypothesis template. Hugging Face’s NLI-based zeroshot pipeline essentially converts each label into a whole phrase, and then compares texts againt these phrases to see whether the phrase “agrees” with or “contradicts” each text. The template parameter can be used to determine how a label is converted into a phrase. The default phrase is "This text is {}.", where the curly braces are then replaced with each label.If you have texts in a specific language (and if you’re using a model appropriate for that single language), you should probably provide a corresponding template in that language. If you have texts in mixed languages (and specify a multilingual model), the default template should be fine.You may also consider using alternative templates specific for your task. E.g. if you’re trying to classify the overall sentiment of product reviews, you may try a template like "The sentiment of this review is {}." (e.g. combined with "labels": ["positive", "negative"]).

multilabel

boolean

default:"false"

Whether to allow multiple labels/classes per text. If this parameter is false (default), only the label for the class with the highest probability will be returned.If it is true, each class will be assigned a probability between 0 and 1. The result will then contain a list of labels corresponding to all classes with probabilities greater than the threshold min_prob (see below). The classes will be returned in the form of ordered lists, with the first element being the label of the class with the highest probability.

min_prob

[number, null]

Only return labels for classes with probability greater than this value. In single label classification, if even the most probable class falls below this threshold, a missing value will be returned instead of a label.When performing multilabel classification, any classes with probabilities below this threshold will simply be removed from the list of labels in each row. A value of null (default), 0.0, or simply not specifying this parameter will disable filtering of categories. In this case, the result will contain all classes/labels for each row, ordered by probability in descending order.

batch_size

integer

default:"2"

How many texts to process simultaneously. May get ignored when running on CPU.Values must be in the following range:

1 ≤ batch_size ≤ 64

Prepare

Report

Analyse

Usage

Inputs & Outputs

Configuration

Prepare

Report

Analyse

​Usage

​Inputs & Outputs

​Configuration

Usage

Inputs & Outputs

Configuration