Label texts containing from query¶
NLP • text
Label texts given an elastic-like query string.
Given a query of the form "word1; word2 OR word3", texts containing "word1" will be labeled as "word1", and texts containing "word2" or "word3" will be labeled as "word2 OR word3". In other words, each semicolon-separated string acts as both query and corresponding label. Texts matching multiple queries will be assigned multiple labels.
Usage¶
The following are the step's expected inputs and outputs and their specific types.
label_texts_containing_from_query(text_col: text|category, {
"param": value
}) -> (labels: column)
where the object {"param": value}
is optional in most cases and if present may contain any of the parameters described in the
corresponding section below.
Example¶
label_texts_containing_from_query(ds.text, {"query": "startup OR entrepreneur; marketing OR -digital; devops"}) -> (ds.field_of_occupation)
Inputs¶
text_col: column:text|category
A text column to label.
Outputs¶
labels: column
A column containing the labels assigned to each text.
Parameters¶
query: string
Query to label. Query is a string of labels/categories and associated keywords (see examples below). Use ";" to separate categories, "OR" to join words for a category, and "-" to exclude words from a category. The category label(s) will be formed using the query, e.g. a text containing "AA" and "BB" will be tagged as [AA,BB].
Example parameter values:
"Cristiano OR -Five; for"
accent_sensitive: boolean = False
Whether to make search accent sensitive.
case_sensitive: boolean = False
Whether to make search case sensitive.
whole_words: boolean = True
Whether to match whole words only. If enabled, only matches a word if it is surrounded by non-alphanumeric characters.
first_only: boolean = False
Whether to return only the first match. If True, only the first match will be assigned to each text. The result will be a simple categorical column. If False, all identified matches will be assigned to each text. The result will be a multivalued column containing lists of categories.