Classify texts using custom labels/categories.
classify_text
,
this step doesn’t require a model specifically trained with the given labels. Any model from the
Hugging Face hub that is compatible with their
zeroshot classification pipeline
can be used here. By default this is the (English) valhalla/distilbart-mnli-12-3
,
for a good trade-off between model size and accuracy. If a multilingual model is needed
you could try e.g. joeddav/xlm-roberta-large-xnli
.
Note that we do not validate the model name before executing it, so make sure it
corresponds to an existing model in the hub, otherwise the step will fail.
Examples
sport
, politics
and business
:ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Inputs
Outputs
labels
parameter below. Depending on whether multilabel classification is activated or not, the output will be
either a simple categorical, or a multilabel categorical column (containing list of categories).step(..., {"param": "value", ...}) -> (output)
.
Parameters
Array items
nli
, mnli
,
or xnli
in their name. For further details on zeroshot learning using NLI models see
e.g. here.Also, note that if the name doesn’t correspond to a model existing in the hub the step will fail.Examples
"This text is {}."
, where the curly braces are then replaced
with each label.If you have texts in a specific language (and if you’re using a model appropriate for that single language),
you should probably provide a corresponding template in that language. If you have texts in
mixed languages (and specify a multilingual model), the default template should be fine.You may also consider using alternative templates specific for your task. E.g. if you’re trying to
classify the overall sentiment of product reviews, you may try a template like
"The sentiment of this review is {}."
(e.g. combined with "labels": ["positive", "negative"]
).false
(default), only the label for the class with the highest probability
will be returned.If it is true
, each class will be assigned a probability between 0 and 1. The result will
then contain a list of labels corresponding to all classes with probabilities greater than the
threshold min_prob
(see below). The classes will be returned in the form of ordered lists,
with the first element being the label of the class with the highest probability.null
(default), 0.0
, or simply not specifying
this parameter will disable filtering of categories. In this case, the result will contain all classes/labels
for each row, ordered by probability in descending order.