zeroshot_classify_text
Classify texts using custom labels/categories.
In contrast with classify_text
,
this step doesn’t require a model specifically trained with the given labels. Any model from the
Hugging Face hub that is compatible with their
zeroshot classification pipeline
can be used here. By default this is the (English) valhalla/distilbart-mnli-12-3
,
for a good trade-off between model size and accuracy. If a multilingual model is needed
you could try e.g. joeddav/xlm-roberta-large-xnli
.
Note that we do not validate the model name before executing it, so make sure it corresponds to an existing model in the hub, otherwise the step will fail.
A column of texts to classify.
The inferred class of each text. The labels of individual categories are those passed in using the labels
parameter below. Depending on whether multilabel classification is activated or not, the output will be
either a simple categorical, or a multilabel categorical column (containing list of categories).
A list of labels/categories to automatically assign to each text. This can be somewhat of a black art. As a simple, if perhaps obvious heuristic, the fewer and less ambiguous the selected categories the faster and most probably accurate the resulting classification. As the number and ambiguity of categories increases one can expect less precise results.
The name of a model. This should be the full name (including the organization if applicable) of a model in the Hugging Face model hub. You can copy it by clicking on the icon next to the model’s name on its dedicated web page.
Note that for now Hugging Face only supports models trained on NLI (natural language inference)
tasks in their zeroshot pipeline. These can be recognized usually by mentioning nli
, mnli
,
or xnli
in their name. For further details on zeroshot learning using NLI models see
e.g. here.
Also, note that if the name doesn’t correspond to a model existing in the hub the step will fail.
- joeddav/xlm-roberta-large-xnli
- facebook/bart-large-mnli
A custom hypothesis template.
Hugging Face’s NLI-based zeroshot pipeline essentially converts each label into a whole phrase,
and then compares texts againt these phrases to see whether the phrase “agrees” with or “contradicts”
each text. The template parameter can be used to determine how a label is converted into a
phrase. The default phrase is "This text is {}."
, where the curly braces are then replaced
with each label.
If you have texts in a specific language (and if you’re using a model appropriate for that single language), you should probably provide a corresponding template in that language. If you have texts in mixed languages (and specify a multilingual model), the default template should be fine.
You may also consider using alternative templates specific for your task. E.g. if you’re trying to
classify the overall sentiment of product reviews, you may try a template like
"The sentiment of this review is {}."
(e.g. combined with "labels": ["positive", "negative"]
).
Whether to allow multiple labels/classes per text.
If this parameter is false
(default), only the label for the class with the highest probability
will be returned.
If it is true
, each class will be assigned a probability between 0 and 1. The result will
then contain a list of labels corresponding to all classes with probabilities greater than the
threshold min_prob
(see below). The classes will be returned in the form of ordered lists,
with the first element being the label of the class with the highest probability.
Only return labels for classes with probability greater than this value. In single label classification, if even the most probable class falls below this threshold, a missing value will be returned instead of a label.
When performing multilabel classification, any classes with probabilities below this threshold will simply
be removed from the list of labels in each row. A value of null
(default), 0.0
, or simply not specifying
this parameter will disable filtering of categories. In this case, the result will contain all classes/labels
for each row, ordered by probability in descending order.
How many texts to process simultaneously. May get ignored when running on CPU.
Values must be in the following range:
1 ≤ batch_size ≤ 64
Was this page helpful?