Create a new network dataset where nodes (rows) represent the original dataset’s variables (its columns), and links represent dependencies between variables (i.e. associations/correlations). By default the step measures “mutual information” between variables.

In effect, all pair-wise “correlations” between the original dataset’s columns are calculated. A threshold is then applied to extract only the largest (most interesting) “correlations”. These are then translated into network links between nodes representing the original variables. Each node/row in the new dataset will also contain information about its correlation with all other nodes (variables).

Note: in this first version, only quantitative and categorical variables will be analyzed (but not tags, lists, embeddings etc.). You can pass a dataset with arbitrary column types, but those not supported by the selected correlation method will be ignored in the result.

Usage

The following examples show how the step can be used in a recipe.

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).