Skip to content

Label texts containing

NLP ยท text

Categorize texts containing specific keywords with custom labels.

Assigns each text to one or more categories. Each category is defined by a list of keywords a text must include or exclude to be labelled accordingly. In addition, each category may specify whether a keyword must be matched explicitly, ignoring its case (lower, upper) etc. See parameters below for further details.

Example

The following defines the keywords to be included or exluded for each of three categories, labelled "journalist", "business" and "CEO". Note how in the case of "CEO" we're looking for occurrences of the spelling with capitals only.

label_texts_containing(ds.text, {
  "journalists": {
    "include": ["journalist", "journalism", "news"],
    "exclude": ["blogger"],
    "case_sensitive": false
  },
  "business": {
    "include":["startup", "entrepreneur", "founder"]
  },
  "CEOs": {
    "include": ["CEO"],
    "case_sensitive": true
  }
}) -> (ds.field_of_occupation)

Usage

The following are the step's expected inputs and outputs and their specific types.

label_texts_containing(text_col: text, {"param": value}) -> (labels: list[category])

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Inputs


text_col: column:text

A text column to label.

Outputs


labels: column:list[category]

A column containing the labels assigned to each text.

Parameters


A dictionary of labels/categories and associated keywords (see examples below) Additionally, you can pass the "accent_sensitive", "case_sensitive" and "whole_words" flags to each category, their default being false, false and true respectively, that is, ignore accent and case but match only whole words.


Categories: object

One or more named text categories. Each parameter should be a key indicating the name/label to show for a specific text category, and should have an object as value specifying the terms a text must or must not contain for that particular label to apply. Also see examples above.

Items in Categories

include: array[string]

List of strings a text must include.


exclude: array[string]

List of strings a text must not include.


accent_sensitive: boolean = False

Make the search accent sensitive?


case_sensitive: boolean = False

Make the search case sensitive?


whole_words: boolean = True

Match whole words only?