Label categories¶
fast step NLP • text
Relabel categories based on the top terms in each category.
This function enables the relabeling of category labels based on the most significant terms, or top_terms
, within
each category. It takes two columns as inputs: one with the old_labels
, which can be single or multi-valued categories,
and one with the top_terms
for each data point. The replacement of the labels is influenced by the specified rank method,
which can be TFIDF
, BACKGROUND
, FOREGROUND
, UPLIFT
, ORDINAL
, or ALPHANUM
, and the number of top terms considered
(specified by top_n
).
Usage¶
The following are the step's expected inputs and outputs and their specific types.
label_categories(
old_labels: category|list,
top_terms: category|text|list,
{
"param": value
}
) -> (new_labels: column)
where the object {"param": value}
is optional in most cases and if present may contain any of the parameters described in the
corresponding section below.
Example¶
To replace labels in a column of categories using TFIDF:
label_categories(ds.old_labels, ds.top_terms) -> (ds.new_labels)
More examples
To replace labels in a column of categories using BACKGROUND:
label_categories(ds.old_labels, ds.top_terms, {
rank_method: 'BACKGROUND',
top_n: 3
}) -> (ds.new_labels)
To replace labels in a column of categories using BACKGROUND and ascending order:
label_categories(ds.old_labels, ds.top_terms, {
rank_method: 'BACKGROUND',
top_n: 3,
ascending: true
}) -> (ds.new_labels)
Inputs¶
old_labels: column:category|list
A column containing the old labels, which could be single-value or multi-value categories.
top_terms: column:category|text|list
A column containing lists of top terms for each data point.
Outputs¶
new_labels: column
The output column. Its data type will depend on the 'old_labels' input column type.
Parameters¶
rank_method: string = "TFIDF"
The method used to rank the top terms.
Must be one of:
"TFIDF"
,
"BACKGROUND"
,
"FOREGROUND"
,
"UPLIFT"
,
"ORDINAL"
,
"ALPHANUM"
top_n: integer = 4
The number of top terms considered for each label.
Range: 1 ≤ top_n < inf
ascending: boolean = False
Whether the terms should be sorted in ascending order.