Skip to content

Replace regex

NLP ยท text

Replace parts of text detected with a regular expression.

A regular expression (or regex, regex pattern) is a sequence of characters that forms a search pattern. This pattern is compared against texts, and any matches are substituted by a desired replacement. The replacement can be a simple (constant) text string, or a formatting pattern referencing all or parts of the matched character sequence.

Simple replacement of fixed text strings with another fixed text string can be performed easily. E.g., to replace all occurrences of "hi" with "hello", you'd simply use {"pattern": "hi", "replacement": "hello"}. However, using capturing groups in pattern and replacement parameters allows for much greater flexibility. For example, if a column of texts includes twitter mentions of the form "@abc", the regular expression "pattern": "@(\\w*)" will match these mentions and save the actual name without the "@" character in a capturing group. Using the replacement string "replacement": "{1}" will then replace all matched mentions with only the name part of the twitter handle, effectively removing the "@" tags from all mentions (without removing other occurrences of the "@" character).

To further familiarize yourself with the regex language also see these references:

Example

To change the way dates are formatted in a column of texts from "2019-04-15" to "15.04.2019": The specified pattern will match 3 numbers separated by the minus sign, and will replace such occurences by the same three numbers in reverse order and separated with a period.

replace_regex(ds.text, {
    "pattern": "(\d+)-(\d+)-(\d+)",
    "replacement": "{3}.{2}.{1}"
}) -> (ds.replaced)

Usage

The following are the step's expected inputs and outputs and their specific types.

replace_regex(text: text|list, {"param": value}) -> (replaced: column)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Inputs


text: column:text|list

A column containing text-like values.

Outputs


replaced: column

The output column's data type will depend on the input and specified parameters:

  • text: if input is text and parameter "as_category": false
  • category: if input is not a column of lists and "as_category": true
  • list: if input is a column of lists.

Parameters


pattern: string

Regular expression to be matched in your input texts. The regex pattern may include (numbered) regex capturing groups, which allows this method to use parts of a match to format the way matches are then replaced in the output via the replacement parameter (see below).


replacement: string = ""

A format string determining how matches will be replaced in the text. The replacement string allows for use of python-style string replacement with curly braces and numerical identifiers, e.g. "{1}" instead of the usual regex syntax using backslashes ("\1"). Numerical identifiers refer to capturing groups in the regex pattern (named groups are not supported), where

  • 0 is the whole match
  • 1 is the 1st capturing group
  • 2 is the 2nd capturing group
  • etc...

The default is the empty string "", i.e. matched parts will be removed from the text.


flags: array[string] = 0

Regex configuration flags. Uses Python regex flags determining how to perform matches, e.g. ["a", "ignorecase"].

Items in flags

item: string

Must be one of: "a", "ascii", "debug", "i", "ignorecase", "l", "locale", "m", "multiline", "s", "dotall", "x", "verbose"


as_category: boolean = True

Whether to cast result to category data type when otherwise it would be texts.