Interprets and changes a column’s data to another (semantic) type.
This has two consequences:
It will allow the resulting column to be used by steps only accepting the new type,
e.g. when casting a column of concatenated texts to the "url" type, so that it may be used
where Urls are expected (e.g. the step fetch_url_content).
It will change any values not conformant with the new type to the missing value (NaN). E.g.,
casting a column of mixed data containing numbers to the "number" type, will replace all
values that cannot be read as numbers with NaN.
Note that for each possible type a column can be cast to (via the "type" parameter, e.g. "number",
"category" etc.), the steps accepts different configuration parameters. See the subsections under
Parameters below for further details.
General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.
The following are the inputs expected by the step and the outputs it produces. These are generally
columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced
by name e.g. "churn-clf").
The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).
Separator to mark the decimal part.
Use ”.” or ”,” to indicate how decimal values are separated when parsing text strings
into numerical format. It is automatically assumed that the other character is used as
the thousands separator. E.g. "decimal": "." assumes that the period ”.” is used to
separate decimals and ”,” thousands, as in the number string “12,173.12”.Values must be one of the following:
Separator to mark the thousands.
Use ”.” or ”,” to indicate how thousands are separated when parsing text strings
into numerical format. It is automatically assumed that the other character is used as
the decimal separator. E.g. "thousand": "." assumes that the period ”.” is used to
separate thousands and ”,” decimals, as in the number string “12.173,12”.Values must be one of the following:
A 2-character string identifying the opening and closing brackets used to identify list strings.
For example ”[]”, ”()”, "" etc. If null, any possible bracket characters at the beginning and end of a
string will be removed before parsing the elements.
Separation character for split strings.
Which separation character to use to split input string into list elements.
Note that spaces will always be stripped from individual elements.
Separator to mark the decimal part.
Use ”.” or ”,” to indicate how decimal values are separated when parsing text strings
into numerical format. It is automatically assumed that the other character is used as
the thousands separator. E.g. "decimal": "." assumes that the period ”.” is used to
separate decimals and ”,” thousands, as in the number string “12,173.12”.Values must be one of the following:
Separator to mark the thousands.
Use ”.” or ”,” to indicate how thousands are separated when parsing text strings
into numerical format. It is automatically assumed that the other character is used as
the decimal separator. E.g. "thousand": "." assumes that the period ”.” is used to
separate thousands and ”,” decimals, as in the number string “12.173,12”.Values must be one of the following:
Separator to mark the decimal part.
Use ”.” or ”,” to indicate how decimal values are separated when parsing text strings
into numerical format. It is automatically assumed that the other character is used as
the thousands separator. E.g. "decimal": "." assumes that the period ”.” is used to
separate decimals and ”,” thousands, as in the number string “12,173.12”.Values must be one of the following:
Separator to mark the thousands.
Use ”.” or ”,” to indicate how thousands are separated when parsing text strings
into numerical format. It is automatically assumed that the other character is used as
the decimal separator. E.g. "thousand": "." assumes that the period ”.” is used to
separate thousands and ”,” decimals, as in the number string “12.173,12”.Values must be one of the following:
A 2-character string identifying the opening and closing brackets used to identify list strings.
For example ”[]”, ”()”, "" etc. If null, any possible bracket characters at the beginning and end of a
string will be removed before parsing the elements.
Separation character for split strings.
Which separation character to use to split input string into list elements.
Note that spaces will always be stripped from individual elements.
Separator to mark the decimal part.
Use ”.” or ”,” to indicate how decimal values are separated when parsing text strings
into numerical format. It is automatically assumed that the other character is used as
the thousands separator. E.g. "decimal": "." assumes that the period ”.” is used to
separate decimals and ”,” thousands, as in the number string “12,173.12”.Values must be one of the following:
Separator to mark the thousands.
Use ”.” or ”,” to indicate how thousands are separated when parsing text strings
into numerical format. It is automatically assumed that the other character is used as
the decimal separator. E.g. "thousand": "." assumes that the period ”.” is used to
separate thousands and ”,” decimals, as in the number string “12.173,12”.Values must be one of the following:
Desired semantic type of the converted data.
Convert data to the Date type with "type": "date". This will allow e.g. the extraction of particular
components of the date, like year, month, or day of week (with extract_date_components), the calculation of
elapsed time since a given date (time_interval), as well as enable the use of the Trends section in graphext’s
interface.
Format to parse date strings.
When input data contains strings (dates in text format), indicate how these strings are constructed.
E.g. if dates are in the format “21/07/2020”, use "format": “%d/%m/%Y” to indicate the day, month, year order and
the use of ”/” as the separator of date components. For more details on how to indicate the different
components of the date format see e.g. Python’s strftime.
Unit of timestamp data.
When input data is numeric, indicates whether the numbers correspond to seconds, milliseconds, microseconds
or nanoseconds. Dates will be interpreted as so many elapsed units since the origin
(see origin parameter below).For example, with "unit": "ms" and "origin": "unix" (the default), this would calculate the date
corresponding to x milliseconds since 01/01/1970, where x denotes the input numbers.Values must be one of the following:
Desired semantic type of the converted data.
Convert data to the Date type with "type": "date". This will allow e.g. the extraction of particular
components of the date, like year, month, or day of week (with extract_date_components), the calculation of
elapsed time since a given date (time_interval), as well as enable the use of the Trends section in graphext’s
interface.
A 2-character string identifying the opening and closing brackets used to identify list strings.
For example ”[]”, ”()”, "" etc. If null, any possible bracket characters at the beginning and end of a
string will be removed before parsing the elements.
Separation character for split strings.
Which separation character to use to split input string into list elements.
Note that spaces will always be stripped from individual elements.
Format to parse date strings.
When input data contains strings (dates in text format), indicate how these strings are constructed.
E.g. if dates are in the format “21/07/2020”, use "format": “%d/%m/%Y” to indicate the day, month, year order and
the use of ”/” as the separator of date components. For more details on how to indicate the different
components of the date format see e.g. Python’s strftime.
Unit of timestamp data.
When input data is numeric, indicates whether the numbers correspond to seconds, milliseconds, microseconds
or nanoseconds. Dates will be interpreted as so many elapsed units since the origin
(see origin parameter below).For example, with "unit": "ms" and "origin": "unix" (the default), this would calculate the date
corresponding to x milliseconds since 01/01/1970, where x denotes the input numbers.Values must be one of the following:
Desired semantic type of the converted data.
Convert data to the Text type with "type": "text". This allows the resulting column to
be used e.g. in steps involving natural language processing (NLP).
Desired semantic type of the converted data.
Convert data to the Category type with "type": "category". This will influence how the
column is presented in graphext’s interface, and enables the use of steps like trim_frequencies,
merge_categories etc. When converting from list[category], elements will be joined using
the specified separator.
Separation character for split strings and join elements.
Which separator to use to join elements when converting from list[category] to category.
Note that spaces will always be stripped from individual elements.
Desired semantic type of the converted data.
Convert data to the Category type with "type": "list[category]". This will influence how the
column is presented in graphext’s interface, and enables the use of steps like trim_frequencies, merge_categories etc.
A 2-character string identifying the opening and closing brackets used to identify list strings.
For example ”[]”, ”()”, "" etc. If null, any possible bracket characters at the beginning and end of a
string will be removed before parsing the elements.
Separation character for split strings.
Which separation character to use to split input string into list elements.
Note that spaces will always be stripped from individual elements.
Desired semantic type of the converted data.
Convert data to the Url type with "type": "url". This will allow e.g. fetching of any textual
content found at the specified Url (with fetch_url_content), or linking of a network node
in the interface to the given website (configure_node_url).
A 2-character string identifying the opening and closing brackets used to identify list strings.
For example ”[]”, ”()”, "" etc. If null, any possible bracket characters at the beginning and end of a
string will be removed before parsing the elements.
Separation character for split strings.
Which separation character to use to split input string into list elements.
Note that spaces will always be stripped from individual elements.
Desired semantic type of the converted data.
Convert data to the Sex type with "type": "sex". This is essentially a categorical type
with two predefined values for male and female. Use parse_labels to configure how
raw input values should be interpreted as female or male.
Mapping of raw values to female and male categories.
An object of the form {"female": "female_value", "male": "male_value"} that tells the parser
which raw values in the input data should be interpreted as female and as male.
For example, {"female": "woman", "male": "man"} will parse woman as female and man as male.
Desired semantic type of the converted data.
Convert data to the Sex type with "type": "list[sex]".
Use parse_labels to configure how raw input values are interpreted.
A 2-character string identifying the opening and closing brackets used to identify list strings.
For example ”[]”, ”()”, "" etc. If null, any possible bracket characters at the beginning and end of a
string will be removed before parsing the elements.
Separation character for split strings.
Which separation character to use to split input string into list elements.
Note that spaces will always be stripped from individual elements.
Mapping of raw values to female and male categories.
An object of the form {"female": "female_value", "male": "male_value"} that tells the parser
which raw values in the input data should be interpreted as female and as male.
For example, {"female": "woman", "male": "man"} will parse woman as female and man as male.
Desired semantic type of the converted data.
Convert data to the Boolean (logical) type with "type": "boolean". If the input data is numeric,
0s will be treated as False and all other values as True. If the input data contains text strings,
the values in lower- or uppercase will be interpreted as True, and the
values as False. Any remaining values will be converted to NaN (missing).
Desired semantic type of the converted data.
Convert data to the Boolean (logical) type with "type": "list[boolean]". If the input data is numeric, 0s will be treated as False and all other values as True. If the input data contains text strings, the values in lower- or uppercase will be interpreted as True, and the values as False. Any remaining values will be converted to NaN (missing).
A 2-character string identifying the opening and closing brackets used to identify list strings.
For example ”[]”, ”()”, "" etc. If null, any possible bracket characters at the beginning and end of a
string will be removed before parsing the elements.
Separation character for split strings.
Which separation character to use to split input string into list elements.
Note that spaces will always be stripped from individual elements.