A regular expression (or regex, regex pattern) is a sequence of characters that forms a search pattern. This pattern is compared against texts, and any matches returned. The matches don’t have to be returned as found, but can be formatted using the output parameter. Check below references to familiarize yourself with the regex language:

Also see the pattern parameter below for more details.

pattern
string
required

A regular expression. The pattern to be matched in input texts. May include (numbered) regex capturing groups, which allows this method to use parts of a match to format the way matches are represented in the output via the output parameter. The latter uses google-re2 string replacement with curly braces and numerical identifiers, e.g. "" instead of the usual regex syntax using backslashes, like “\1”. Numerical identifiers refer to capturing groups in the regex pattern (named groups are not supported), where

  • 0 is the whole match
  • 1 is the 1st capturing group
  • 2 is the 2nd capturing group
  • etc…

The default is "{0}", i.e. simply returning the full match.

For example, if a column of texts includes twitter mentions of the form “@abc”, the regular expression

"pattern": "(@)(\\w*)"

will match these mentions and save the ”@” character and the actual name in two separate capturing groups. Using the output format

"output": "Match: {0}, Tag: {1}, Name: {2}"

will then return matches in the form “Match: @abc, Tag: @, Name: abc”.

output
string
default: "{0}"

Output format string. Determines how matches will be represented in the output. Use numbers in curly braces to refer to captured groups.

flags
array[string]

Match criteria. Python-style regex flags determining how to match.

extract_all
boolean

Whether to extract first match only or all matches (as lists).

concat_matches
boolean

Whether to concatenate all matches into a single text string.

separator
string
default: " "

The character (or string of characters) to use when concatenating multiple matches.

as_category
boolean
default: "true"

Whether to return a categorical rather than text column. When the result would be text strings rather than lists ("extract_all": false" or "concat_matches": false), whether to return a column of type category rather than text.