cast
Interprets and changes a column’s data to another (semantic) type.
This has two consequences:
- It will allow the resulting column to be used by steps only accepting the new type,
e.g. when casting a column of concatenated texts to the
"url"
type, so that it may be used where Urls are expected (e.g. the stepfetch_url_content
). - It will change any values not conformant with the new type to the missing value (NaN). E.g.,
casting a column of mixed data containing numbers to the
"number"
type, will replace all values that cannot be read as numbers with NaN.
Note that for each possible type a column can be cast to (via the "type"
parameter, e.g. "number"
,
"category"
etc.), the steps accepts different configuration parameters. See the subsections under
Parameters below for further details.
Usage
The following example shows how the step can be used in a recipe.
E.g. to simply convert a text
column to a category
column, use:
E.g. to simply convert a text
column to a category
column, use:
General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.
Inputs & Outputs
The following are the inputs expected by the step and the outputs it produces. These are generally
columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Configuration
The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output)
.
Desired semantic type of the converted data.
Make data numerical with "type": "number"
.
Separator to mark the decimal part.
Use ”.” or ”,” to indicate how decimal values are separated when parsing text strings
into numerical format. It is automatically assumed that the other character is used as
the thousands separator. E.g. "decimal": "."
assumes that the period ”.” is used to
separate decimals and ”,” thousands, as in the number string “12,173.12”.
Values must be one of the following:
.
,
Separator to mark the thousands.
Use ”.” or ”,” to indicate how thousands are separated when parsing text strings
into numerical format. It is automatically assumed that the other character is used as
the decimal separator. E.g. "thousand": "."
assumes that the period ”.” is used to
separate thousands and ”,” decimals, as in the number string “12.173,12”.
Values must be one of the following:
.
,
Desired semantic type of the converted data.
Make data numerical with "type": "number"
.
Separator to mark the decimal part.
Use ”.” or ”,” to indicate how decimal values are separated when parsing text strings
into numerical format. It is automatically assumed that the other character is used as
the thousands separator. E.g. "decimal": "."
assumes that the period ”.” is used to
separate decimals and ”,” thousands, as in the number string “12,173.12”.
Values must be one of the following:
.
,
Separator to mark the thousands.
Use ”.” or ”,” to indicate how thousands are separated when parsing text strings
into numerical format. It is automatically assumed that the other character is used as
the decimal separator. E.g. "thousand": "."
assumes that the period ”.” is used to
separate thousands and ”,” decimals, as in the number string “12.173,12”.
Values must be one of the following:
.
,
Desired semantic type of the converted data.
Make data numerical with "type": "list[number]"
.
A 2-character string identifying the opening and closing brackets used to identify list strings.
For example ”[]”, ”()”, "" etc. If null
, any possible bracket characters at the beginning and end of a
string will be removed before parsing the elements.
Separation character for split strings. Which separation character to use to split input string into list elements. Note that spaces will always be stripped from individual elements.
Separator to mark the decimal part.
Use ”.” or ”,” to indicate how decimal values are separated when parsing text strings
into numerical format. It is automatically assumed that the other character is used as
the thousands separator. E.g. "decimal": "."
assumes that the period ”.” is used to
separate decimals and ”,” thousands, as in the number string “12,173.12”.
Values must be one of the following:
.
,
Separator to mark the thousands.
Use ”.” or ”,” to indicate how thousands are separated when parsing text strings
into numerical format. It is automatically assumed that the other character is used as
the decimal separator. E.g. "thousand": "."
assumes that the period ”.” is used to
separate thousands and ”,” decimals, as in the number string “12.173,12”.
Values must be one of the following:
.
,
Desired semantic type of the converted data.
Make data a currency with "type": "currency"
.
Separator to mark the decimal part.
Use ”.” or ”,” to indicate how decimal values are separated when parsing text strings
into numerical format. It is automatically assumed that the other character is used as
the thousands separator. E.g. "decimal": "."
assumes that the period ”.” is used to
separate decimals and ”,” thousands, as in the number string “12,173.12”.
Values must be one of the following:
.
,
Separator to mark the thousands.
Use ”.” or ”,” to indicate how thousands are separated when parsing text strings
into numerical format. It is automatically assumed that the other character is used as
the decimal separator. E.g. "thousand": "."
assumes that the period ”.” is used to
separate thousands and ”,” decimals, as in the number string “12.173,12”.
Values must be one of the following:
.
,
Desired semantic type of the converted data.
Make data a currency with "type": "list[currency]"
.
A 2-character string identifying the opening and closing brackets used to identify list strings.
For example ”[]”, ”()”, "" etc. If null
, any possible bracket characters at the beginning and end of a
string will be removed before parsing the elements.
Separation character for split strings. Which separation character to use to split input string into list elements. Note that spaces will always be stripped from individual elements.
Separator to mark the decimal part.
Use ”.” or ”,” to indicate how decimal values are separated when parsing text strings
into numerical format. It is automatically assumed that the other character is used as
the thousands separator. E.g. "decimal": "."
assumes that the period ”.” is used to
separate decimals and ”,” thousands, as in the number string “12,173.12”.
Values must be one of the following:
.
,
Separator to mark the thousands.
Use ”.” or ”,” to indicate how thousands are separated when parsing text strings
into numerical format. It is automatically assumed that the other character is used as
the decimal separator. E.g. "thousand": "."
assumes that the period ”.” is used to
separate thousands and ”,” decimals, as in the number string “12.173,12”.
Values must be one of the following:
.
,
Desired semantic type of the converted data.
Convert data to the Date type with "type": "date"
. This will allow e.g. the extraction of particular
components of the date, like year, month, or day of week (with extract_date_components
), the calculation of
elapsed time since a given date (time_interval
), as well as enable the use of the Trends section in graphext’s
interface.
Format to parse date strings.
When input data contains strings (dates in text format), indicate how these strings are constructed.
E.g. if dates are in the format “21/07/2020”, use "format": “%d/%m/%Y”
to indicate the day, month, year order and
the use of ”/” as the separator of date components. For more details on how to indicate the different
components of the date format see e.g. Python’s strftime.
Unit of timestamp data.
When input data is numeric, indicates whether the numbers correspond to seconds, milliseconds, microseconds
or nanoseconds. Dates will be interpreted as so many elapsed units since the origin
(see origin
parameter below).
For example, with "unit": "ms"
and "origin": "unix"
(the default), this would calculate the date
corresponding to x milliseconds since 01/01/1970, where x denotes the input numbers.
Values must be one of the following:
D
s
ms
us
ns
Desired semantic type of the converted data.
Convert data to the Date type with "type": "date"
. This will allow e.g. the extraction of particular
components of the date, like year, month, or day of week (with extract_date_components
), the calculation of
elapsed time since a given date (time_interval
), as well as enable the use of the Trends section in graphext’s
interface.
A 2-character string identifying the opening and closing brackets used to identify list strings.
For example ”[]”, ”()”, "" etc. If null
, any possible bracket characters at the beginning and end of a
string will be removed before parsing the elements.
Separation character for split strings. Which separation character to use to split input string into list elements. Note that spaces will always be stripped from individual elements.
Format to parse date strings.
When input data contains strings (dates in text format), indicate how these strings are constructed.
E.g. if dates are in the format “21/07/2020”, use "format": “%d/%m/%Y”
to indicate the day, month, year order and
the use of ”/” as the separator of date components. For more details on how to indicate the different
components of the date format see e.g. Python’s strftime.
Unit of timestamp data.
When input data is numeric, indicates whether the numbers correspond to seconds, milliseconds, microseconds
or nanoseconds. Dates will be interpreted as so many elapsed units since the origin
(see origin
parameter below).
For example, with "unit": "ms"
and "origin": "unix"
(the default), this would calculate the date
corresponding to x milliseconds since 01/01/1970, where x denotes the input numbers.
Values must be one of the following:
D
s
ms
us
ns
Desired semantic type of the converted data.
Convert data to the Text type with "type": "text"
. This allows the resulting column to
be used e.g. in steps involving natural language processing (NLP).
Desired semantic type of the converted data.
Convert data to the Category type with "type": "category"
. This will influence how the
column is presented in graphext’s interface, and enables the use of steps like trim_frequencies
,
merge_categories
etc. When converting from list[category]
, elements will be joined using
the specified separator.
Separation character for split strings and join elements. Which separator to use to join elements when converting from list[category] to category. Note that spaces will always be stripped from individual elements.
Desired semantic type of the converted data.
Convert data to the Category type with "type": "list[category]"
. This will influence how the
column is presented in graphext’s interface, and enables the use of steps like trim_frequencies
, merge_categories
etc.
A 2-character string identifying the opening and closing brackets used to identify list strings.
For example ”[]”, ”()”, "" etc. If null
, any possible bracket characters at the beginning and end of a
string will be removed before parsing the elements.
Separation character for split strings. Which separation character to use to split input string into list elements. Note that spaces will always be stripped from individual elements.
Desired semantic type of the converted data.
Convert data to the Url type with "type": "url"
. This will allow e.g. fetching of any textual
content found at the specified Url (with fetch_url_content
), or linking of a network node
in the interface to the given website (configure_node_url
).
Desired semantic type of the converted data.
Convert data to the Url type with "type": "list[url]"
.
A 2-character string identifying the opening and closing brackets used to identify list strings.
For example ”[]”, ”()”, "" etc. If null
, any possible bracket characters at the beginning and end of a
string will be removed before parsing the elements.
Separation character for split strings. Which separation character to use to split input string into list elements. Note that spaces will always be stripped from individual elements.
Desired semantic type of the converted data.
Convert data to the Sex type with "type": "sex"
. This is essentially a categorical type
with two predefined values for male
and female
. How the two categories are detected/parsed
in raw data, and with which label to represent them can be configured with below parameters.
The labels used to identify female and male categories.
An object of the form {"female": "female_label", "male": "male_label"}
, indicating how to
represent each sex in the data. E.g. as F/M or ♀️/♂️ etc.
Desired semantic type of the converted data.
Convert data to the sex type with "type": "list[sex]"
.
A 2-character string identifying the opening and closing brackets used to identify list strings.
For example ”[]”, ”()”, "" etc. If null
, any possible bracket characters at the beginning and end of a
string will be removed before parsing the elements.
Separation character for split strings. Which separation character to use to split input string into list elements. Note that spaces will always be stripped from individual elements.
The labels used to identify female and male categories.
An object of the form {"female": "female_label", "male": "male_label"}
, indicating how to
represent each sex in the data. E.g. as F/M or ♀️/♂️ etc.
Desired semantic type of the converted data.
Convert data to the Boolean (logical) type with "type": "boolean"
. If the input data is numeric,
0s will be treated as False and all other values as True. If the input data contains text strings,
the values in lower- or uppercase will be interpreted as True, and the
values as False. Any remaining values will be converted to NaN (missing).
Desired semantic type of the converted data.
Convert data to the Boolean (logical) type with "type": "list[boolean]"
. If the input data is numeric, 0s will be treated as False and all other values as True. If the input data contains text strings, the values in lower- or uppercase will be interpreted as True, and the values as False. Any remaining values will be converted to NaN (missing).
A 2-character string identifying the opening and closing brackets used to identify list strings.
For example ”[]”, ”()”, "" etc. If null
, any possible bracket characters at the beginning and end of a
string will be removed before parsing the elements.
Separation character for split strings. Which separation character to use to split input string into list elements. Note that spaces will always be stripped from individual elements.