Skip to main content
Given a query of the form “word1; word2 OR word3”, texts containing “word1” will be labeled as “word1”, and texts containing “word2” or “word3” will be labeled as “word2 OR word3”. In other words, each semicolon-separated string acts as both query and corresponding label. Texts matching multiple queries will be assigned multiple labels.

Usage

The following example shows how the step can be used in a recipe.

Examples

  • Example 1
  • Signature
label_texts_containing_from_query(ds.text, {"query": "startup OR entrepreneur; marketing OR -digital; devops"}) -> (ds.field_of_occupation)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").
text_col
column[text|category]
required
A text column to label.
labels
column
required
A column containing the labels assigned to each text.

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

query
string
required
Query to label. Query is a string of labels/categories and associated keywords (see examples below). Use ”;” to separate categories, “OR” to join words for a category, and ”-” to exclude words from a category. The category label(s) will be formed using the query, e.g. a text containing “AA” and “BB” will be tagged as [AA,BB].
  • Cristiano OR -Five; for
accent_sensitive
boolean
default:"false"
Whether to make search accent sensitive.
case_sensitive
boolean
default:"false"
Whether to make search case sensitive.
whole_words
boolean
default:"true"
Whether to match whole words only. If enabled, only matches a word if it is surrounded by non-alphanumeric characters.
first_only
boolean
default:"false"
Whether to return only the first match. If True, only the first match will be assigned to each text. The result will be a simple categorical column. If False, all identified matches will be assigned to each text. The result will be a multivalued column containing lists of categories.
I