In contrast with classify_text, this step doesn’t require a model specifically trained with the given labels. Any model from the Hugging Face hub that is compatible with their zeroshot classification pipeline can be used here. By default this is the (English) valhalla/distilbart-mnli-12-3, for a good trade-off between model size and accuracy. If a multilingual model is needed you could try e.g. joeddav/xlm-roberta-large-xnli.

Note that we do not validate the model name before executing it, so make sure it corresponds to an existing model in the hub, otherwise the step will fail.

labels
array[string]

A list of labels/categories to automatically assign to each text. This can be somewhat of a black art. As a simple, if perhaps obvious heuristic, the fewer and less ambiguous the selected categories the faster and most probably accurate the resulting classification. As the number and ambiguity of categories increases one can expect less precise results.

model
string
default: "valhalla/distilbart-mnli-12-3"

The name of a model. This should be the full name (including the organization if applicable) of a model in the Hugging Face model hub. You can copy it by clicking on the icon next to the model’s name on its dedicated web page.

Note that for now Hugging Face only supports models trained on NLI (natural language inference) tasks in their zeroshot pipeline. These can be recognized usually by mentioning nli, mnli, or xnli in their name. For further details on zeroshot learning using NLI models see e.g. here.

Also, note that if the name doesn’t correspond to a model existing in the hub the step will fail.

template
string

A custom hypothesis template. Hugging Face’s NLI-based zeroshot pipeline essentially converts each label into a whole phrase, and then compares texts againt these phrases to see whether the phrase “agrees” with or “contradicts” each text. The template parameter can be used to determine how a label is converted into a phrase. The default phrase is "This text is {}.", where the curly braces are then replaced with each label.

If you have texts in a specific language (and if you’re using a model appropriate for that single language), you should probably provide a corresponding template in that language. If you have texts in mixed languages (and specify a multilingual model), the default template should be fine.

You may also consider using alternative templates specific for your task. E.g. if you’re trying to classify the overall sentiment of product reviews, you may try a template like "The sentiment of this review is {}." (e.g. combined with "labels": ["positive", "negative"]).

multilabel
boolean

Whether to allow multiple labels/classes per text. If this parameter is false (default), only the label for the class with the highest probability will be returned.

If it is true, each class will be assigned a probability between 0 and 1. The result will then contain a list of labels corresponding to all classes with probabilities greater than the threshold min_prob (see below). The classes will be returned in the form of ordered lists, with the first element being the label of the class with the highest probability.

min_prob
[number, null]

Only return labels for classes with probability greater than this value. In single label classification, if even the most probable class falls below this threshold, a missing value will be returned instead of a label.

When performing multilabel classification, any classes with probabilities below this threshold will simply be removed from the list of labels in each row. A value of null (default), 0.0, or simply not specifying this parameter will disable filtering of categories. In this case, the result will contain all classes/labels for each row, ordered by probability in descending order.

batch_size
integer
default: "2"

How many texts to process simultaneously. May get ignored when running on CPU.

Values must be in the following range:

1 ≤ batch_size ≤ 64