Skip to content

Label categories

fast step  NLPtext

Relabel categories based on the top terms in each category.

This function enables the relabeling of category labels based on the most significant terms, or top_terms, within each category. It takes two columns as inputs: one with the old_labels, which can be single or multi-valued categories, and one with the top_terms for each data point. The replacement of the labels is influenced by the specified rank method, which can be TFIDF, BACKGROUND, FOREGROUND, UPLIFT, ORDINAL, or ALPHANUM, and the number of top terms considered (specified by top_n).

Usage


The following are the step's expected inputs and outputs and their specific types.

Step signature
label_categories(
    old_labels: category|list,
    top_terms: category|text|list, 
    {
        "param": value
    }
) -> (new_labels: column)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Example

To replace labels in a column of categories using TFIDF:

Example call (in recipe editor)
label_categories(ds.old_labels, ds.top_terms) -> (ds.new_labels)
More examples

To replace labels in a column of categories using BACKGROUND:

Example call (in recipe editor)
label_categories(ds.old_labels, ds.top_terms, {
  rank_method: 'BACKGROUND',
  top_n: 3
}) -> (ds.new_labels)

To replace labels in a column of categories using BACKGROUND and ascending order:

Example call (in recipe editor)
label_categories(ds.old_labels, ds.top_terms, {
  rank_method: 'BACKGROUND',
  top_n: 3,
  ascending: true
}) -> (ds.new_labels)

Inputs


old_labels: column:category|list

A column containing the old labels, which could be single-value or multi-value categories.


top_terms: column:category|text|list

A column containing lists of top terms for each data point.

Outputs


new_labels: column

The output column. Its data type will depend on the 'old_labels' input column type.

Parameters


rank_method: string = "TFIDF"

The method used to rank the top terms.

Must be one of: "TFIDF", "BACKGROUND", "FOREGROUND", "UPLIFT", "ORDINAL", "ALPHANUM"


top_n: integer = 4

The number of top terms considered for each label.

Range: 1 ≤ top_n < inf


ascending: boolean = False

Whether the terms should be sorted in ascending order.