Skip to content

Label texts containing from query

NLPtext

Label texts given an elastic-like query string.

Given a query of the form "word1; word2 OR word3", texts containing "word1" will be labeled as "word1", and texts containing "word2" or "word3" will be labeled as "word2 OR word3". In other words, each semicolon-separated string acts as both query and corresponding label. Texts matching multiple queries will be assigned multiple labels.

Usage


The following are the step's expected inputs and outputs and their specific types.

Step signature
label_texts_containing_from_query(text_col: text|category, {"param": value}) -> (labels: column)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Example

Example call (in recipe editor)
label_texts_containing_from_query(ds.text, {"query": "startup OR entrepreneur; marketing OR -digital; devops"}) -> (ds.field_of_occupation)

Inputs


text_col: column:text|category

A text column to label.

Outputs


labels: column

A column containing the labels assigned to each text.

Parameters


query: string

Query to label. Query is a string of labels/categories and associated keywords (see examples below). Use ";" to separate categories, "OR" to join words for a category, and "-" to exclude words from a category. The category label(s) will be formed using the query, e.g. a text containing "AA" and "BB" will be tagged as [AA,BB].

Example parameter values:

  • "Cristiano OR -Five; for"

accent_sensitive: boolean = False

Whether to make search accent sensitive.


case_sensitive: boolean = False

Whether to make search case sensitive.


whole_words: boolean = True

Whether to match whole words only. If enabled, only matches a word if it is surrounded by non-alphanumeric characters.


first_only: boolean = False

Whether to return only the first match. If True, only the first match will be assigned to each text. The result will be a simple categorical column. If False, all identified matches will be assigned to each text. The result will be a multivalued column containing lists of categories.