Skip to content

Extract regex

text · NLP · regular expression

Extract parts of texts detected using regular expressions.

A regular expression (or regex, regex pattern) is a sequence of characters that forms a search pattern. This pattern is compared against texts, and any matches returned. The matches don't have to be returned as found, but can be formatted using the output parameter. Check below references to familiarize yourself with the regex language:

Also see the pattern parameter below for more details.

Example

Extract all twitter mentions with handles between 1 and 15 characters long into lists of mentions

extract_regex(ds.text, {
  "pattern": "@\\w{1,15}",
  "extract_all": true
}) -> (ds.mentions)

Usage

The following are the step's expected inputs and outputs and their specific types.

extract_regex(text: text, {"param": value}) -> (text_extracted: column)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Inputs


text: column:text

A text column to extract parts from.

Outputs


text_extracted: column

A column containing the extracted part (or parts) for each text. The column's data type will depend on the input and specified parameters:

  • Output column has type text when:
    "concat_matches": true or "extract_all": false (matches are strings), and "as_category": false
  • Output column has type category when:
    "concat_matches": true or "extract_all": false (matches are strings), and "as_category": true
  • Output column has type list[category] when:
    "extract_all": true and "concat_matches": false

Parameters


pattern: string

A regular expression. The pattern to be matched in input texts. May include (numbered) regex capturing groups, which allows this method to use parts of a match to format the way matches are represented in the output via the output parameter. The latter uses python-style string replacement with curly braces and numerical identifiers, e.g. "{1}" instead of the usual regex syntax using backslashes, like "\1". Numerical identifiers refer to capturing groups in the regex pattern (named groups are not supported), where

  • 0 is the whole match
  • 1 is the 1st capturing group
  • 2 is the 2nd capturing group
  • etc...

The default is "{0}", i.e. simply returning the full match.

For example, if a column of texts includes twitter mentions of the form "@abc", the regular expression

"pattern": "(@)(\\w*)"

will match these mentions and save the "@" character and the actual name in two separate capturing groups. Using the output format

"output": "Match: {0}, Tag: {1}, Name: {2}"

will then return matches in the form "Match: @abc, Tag: @, Name: abc".

Example parameter values:

  • "@\\\\w{1,15}"

output: string = "{0}"

Output format string. Determines how matches will be represented in the output. Use numbers in curly braces to refer to captured groups.

Example parameter values:

  • "{0}"

flags: array[string] = []

Match criteria. Python regex flags determining how to match, e.g. ["ascii", "ignorecase"].

Items in flags

item: string

Must be one of: "ascii", "ignorecase", "locale", "multiline", "dotall"


extract_all: boolean = False

Whether to extract first match only or all matches (as lists).


concat_matches: boolean = False

Whether to concatenate all matches into a single text string.


separator: string = " "

The character (or string of characters) to use when concatenating multiple matches.


as_category: boolean = True

Whether to return a categorical rather than text column. When the result would be text strings rather than lists ("extract_all": false" or "concat_matches": false), whether to return a column of type category rather than text.