Skip to content

Zeroshot classify text

NLP · inference · classification · model · text · hugging face

Classify texts using custom labels/categories.

In contrast with classify_text, this step doesn't require a model specifically trained with the given labels. Any model from the Hugging Face hub that is compatible with their zeroshot classification pipeline can be used here. By default this is the (English) valhalla/distilbart-mnli-12-3, for a good trade-off between model size and accuracy. If a multilingual model is needed you could try e.g. joeddav/xlm-roberta-large-xnli.

Note that we do not validate the model name before executing it, so make sure it corresponds to an existing model in the hub, otherwise the step will fail.

Experimental

This function is still in the experimental stage and we do not guarantee it won't fail for some combination of model and parameters. Feel free to get in touch if you have problems using it.

Example

E.g., to classify English texts into the three topics sport, politics and business:

zeroshot_classify_text(ds.text, {"labels": ["sport", "politics", "business"]}) -> (ds.topic)
More examples

Or to try and infer the sentiment of texts in multiple languages:

zeroshot_classify_text(ds.review, {
  "labels": ["positive", "negative"],
  "template": "The sentiment of this review is {}.",
  "model": "joeddav/xlm-roberta-large-xnli"
}) -> (ds.review_sentiment)

Usage

The following are the step's expected inputs and outputs and their specific types.

zeroshot_classify_text(text: text, {"param": value}) -> (class: column)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Inputs


text: column:text

A column of texts to classify.

Outputs


class: column

The inferred class of each text. The labels of individual categories are those passed in using the labels parameter below. Depending on whether multilabel classification is activated or not, the output will be either a simple categorical, or a multilabel categorical column (containing list of categories).

Parameters


labels: array[string]

A list of labels/categories to automatically assign to each text. This can be somewhat of a black art. As a simple, if perhaps obvious heuristic, the fewer and less ambiguous the selected categories the faster and most probably accurate the resulting classification. As the number and ambiguity of categories increases one can expect less precise results.


model: string = "valhalla/distilbart-mnli-12-3"

The name of a model. This should be the full name (including the organization if applicable) of a model in the Hugging Face model hub. You can copy it by clicking on the icon next to the model's name on its dedicated web page.

Note that for now Hugging Face only supports models trained on NLI (natural language inference) tasks in their zeroshot pipeline. These can be recognized usually by mentioning nli, mnli, or xnli in their name. For further details on zeroshot learning using NLI models see e.g. here.

Also, note that if the name doesn't correspond to a model existing in the hub the step will fail.

Example parameter values:

  • "joeddav/xlm-roberta-large-xnli"
  • "facebook/bart-large-mnli"

template: string

A custom hypothesis template. Hugging Face's NLI-based zeroshot pipeline essentially converts each label into a whole phrase, and then compares texts againt these phrases to see whether the phrase "agrees" with or "contradicts" each text. The template parameter can be used to determine how a label is converted into a phrase. The default phrase is "This text is {}.", where the curly braces are then replaced with each label.

If you have texts in a specific language (and if you're using a model appropriate for that single language), you should probably provide a corresponding template in that language. If you have texts in mixed languages (and specify a multilingual model), the default template should be fine.

You may also consider using alternative templates specific for your task. E.g. if you're trying to classify the overall sentiment of product reviews, you may try a template like "The sentiment of this review is {}." (e.g. combined with "labels": ["positive", "negative"]).


multilabel: boolean = False

Whether to allow multiple labels/classes per text. If this parameter is false (default), only the label for the class with the highest probability will be returned.

If it is true, each class will be assigned a probability between 0 and 1. The result will then contain a list of labels corresponding to all classes with probabilities greater than the threshold min_prob (see below). The classes will be returned in the form of ordered lists, with the first element being the label of the class with the highest probability.


min_prob: number | null

Only return labels for classes with probability greater than this value. In single label classification, if even the most probable class falls below this threshold, a missing value will be returned instead of a label.

When performing multilabel classification, any classes with probabilities below this threshold will simply be removed from the list of labels in each row. A value of null (default), 0.0, or simply not specifying this parameter will disable filtering of categories. In this case, the result will contain all classes/labels for each row, ordered by probability in descending order.