extract_regex
Extract parts of texts detected using regular expressions.
A regular expression (or regex, regex pattern) is a sequence of characters that forms a search pattern.
This pattern is compared against texts, and any matches returned. The matches don’t have to be returned as found,
but can be formatted using the output
parameter. Check below references to familiarize yourself with the regex
language:
Also see the pattern
parameter below for more details.
Usage
The following example shows how the step can be used in a recipe.
Extract all twitter mentions with handles between 1 and 15 characters long into lists of mentions
Inputs & Outputs
The following are the inputs expected by the step and the outputs it produces. These are generally
columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Configuration
The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output)
.
A regular expression.
The pattern to be matched in input texts. May include (numbered) regex capturing groups,
which allows this method to use parts of a match to format the way matches are represented in the output via the
output
parameter. The latter uses google-re2 string replacement with curly braces and numerical identifiers,
e.g. "" instead of the usual regex syntax using backslashes, like “\1”. Numerical identifiers refer to capturing
groups in the regex pattern (named groups are not supported), where
- 0 is the whole match
- 1 is the 1st capturing group
- 2 is the 2nd capturing group
- etc…
The default is "{0}"
, i.e. simply returning the full match.
For example, if a column of texts includes twitter mentions of the form “@abc”, the regular expression
"pattern": "(@)(\\w*)"
will match these mentions and save the ”@” character and the actual name in two separate capturing groups. Using the output format
"output": "Match: {0}, Tag: {1}, Name: {2}"
will then return matches in the form “Match: @abc, Tag: @, Name: abc”.
Output format string. Determines how matches will be represented in the output. Use numbers in curly braces to refer to captured groups.
Match criteria. Python-style regex flags determining how to match.
Whether to extract first match only or all matches (as lists).
Whether to concatenate all matches into a single text string.
The character (or string of characters) to use when concatenating multiple matches.
Whether to return a categorical rather than text column.
When the result would be text strings rather than lists ("extract_all": false"
or "concat_matches": false
),
whether to return a column of type category rather than text.
Was this page helpful?