Extract regex¶
fast step text • NLP • regular expression
Extract parts of texts detected using regular expressions.
A regular expression (or regex, regex pattern) is a sequence of characters that forms a search pattern.
This pattern is compared against texts, and any matches returned. The matches don't have to be returned as found,
but can be formatted using the output
parameter. Check below references to familiarize yourself with the regex
language:
Also see the pattern
parameter below for more details.
Usage¶
The following are the step's expected inputs and outputs and their specific types.
extract_regex(text: text|category, {"param": value}) -> (text_extracted: column)
where the object {"param": value}
is optional in most cases and if present may contain any of the parameters described in the
corresponding section below.
Example¶
Extract all twitter mentions with handles between 1 and 15 characters long into lists of mentions
extract_regex(ds.text, {
"pattern": "@\\w{1,15}",
"extract_all": true
}) -> (ds.mentions)
Inputs¶
text: column:text|category
A text column to extract parts from.
Outputs¶
text_extracted: column
A column containing the extracted part (or parts) for each text. The column's data type will depend on the input and specified parameters:
- Output column has type text when:
"concat_matches": true
or"extract_all": false
(matches are strings), and"as_category": false
- Output column has type category when:
"concat_matches": true
or"extract_all": false
(matches are strings), and"as_category": true
- Output column has type list[category] when:
"extract_all": true
and"concat_matches": false
Parameters¶
pattern: string
A regular expression. The pattern to be matched in input texts. May include (numbered) regex capturing groups,
which allows this method to use parts of a match to format the way matches are represented in the output via the
output
parameter. The latter uses python-style string replacement with curly braces and numerical identifiers,
e.g. "{1}" instead of the usual regex syntax using backslashes, like "\1". Numerical identifiers refer to capturing
groups in the regex pattern (named groups are not supported), where
- 0 is the whole match
- 1 is the 1st capturing group
- 2 is the 2nd capturing group
- etc...
The default is "{0}"
, i.e. simply returning the full match.
For example, if a column of texts includes twitter mentions of the form "@abc", the regular expression
"pattern": "(@)(\\w*)"
will match these mentions and save the "@" character and the actual name in two separate capturing groups. Using the output format
"output": "Match: {0}, Tag: {1}, Name: {2}"
will then return matches in the form "Match: @abc, Tag: @, Name: abc".
Example parameter values:
"@\\\\w{1,15}"
output: string = "{0}"
Output format string. Determines how matches will be represented in the output. Use numbers in curly braces to refer to captured groups.
Example parameter values:
"{0}"
flags: array[string] = []
Match criteria. Python regex flags determining how to match, e.g. ["ascii", "ignorecase"].
Items in flags
item: string
Must be one of:
"ascii"
,
"ignorecase"
,
"locale"
,
"multiline"
,
"dotall"
extract_all: boolean = False
Whether to extract first match only or all matches (as lists).
concat_matches: boolean = False
Whether to concatenate all matches into a single text string.
separator: string = " "
The character (or string of characters) to use when concatenating multiple matches.
as_category: boolean = True
Whether to return a categorical rather than text column. When the result would be text strings rather than lists ("extract_all": false"
or "concat_matches": false
),
whether to return a column of type category rather than text.