link_similar_columns

Create a new network dataset where nodes (rows) represent the original dataset’s variables (its columns), and links represent dependencies between variables (i.e. associations/correlations). By default the step measures “mutual information” between variables. In effect, all pair-wise “correlations” between the original dataset’s columns are calculated. A threshold is then applied to extract only the largest (most interesting) “correlations”. These are then translated into network links between nodes representing the original variables. Each node/row in the new dataset will also contain information about its correlation with all other nodes (variables). Note: in this first version, only quantitative and categorical variables will be analyzed (but not tags, lists, embeddings etc.). You can pass a dataset with arbitrary column types, but those not supported by the selected correlation method will be ignored in the result.

Usage

The following examples show how the step can be used in a recipe.

Examples

The following example calculates all correlations without filtering, leading to a fully connected correlation network unless some correlations are exactly 0.

links_similar_columns(ds) -> (corrs)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Inputs

Outputs

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

method

string

default:"mutual_information"

Correlation method/statistic. Which statistic to use to measure variable association/correlation.The default is "mutual_information", which applies scikit-learn`s k-nearest neighbors implementations (see here and here). It supports both categorical and quantitative variables, is relatively fast, but doesn’t have a natural upper bound (i.e. values are not in the range [0, 1]).The "distance_correlation" method also supports both categorical and quantitative variables, and has a natural upper bound of 1. It’s relatively slow though, so make sure to select a reasonable value for n_samples."distance_correlation_fast" uses an optimized implementation of distance correlation, but only supports quantitative variables."pearson" is the standard Pearson correlation coefficient, which also only supports quantitative variables.Lastly, "predictive_power" calculates a version of the predictive power score (PPS). This essentially fits a decision tree to predict variable y using only variable x as a predictor, and measures it performance relative to a dummy/baseline prediction. It supports both categorical and quantitative variables.Values must be one of the following:

mutual_information
distance_correlation
distance_correlation_fast
pearson
predictive_power

min_similarity

number

default:"0.0"

Absolute similarity threshold. The minimum “correlation” for the creation of a link between two variables.

min_similarity_quantile

number

default:"0.5"

Similarity threshold expressed as a quantile. E.g. a value of 0.6 means the bottom 60% of “correlations” will be discarded. Both minima (absolute and quantile) must be exceeded for a link to be created.Values must be in the following range:

0.0 ≤ min_similarity_quantile ≤ 1

missing_weight

[number, null]

Replacement for discarded correlations. The weight of links whose correlations don’t pass the minimum threshold. Links with weights of null will be discarded (the default behavior). Can be set e.g. to 0, to generate all possible links.

n_samples

integer

default:"2000"

Number of Samples. It represents the maximum number of samples to use when measuring “correlations”.

top_n_links

[integer, null]

Maximum number of links per node. Ranks the links by weight and keeps only the most similar targets.

random_seed

integer

default:"0"

The random seed used if applicable for selected method.

Prepare

Report

Analyse

Usage

Inputs & Outputs

Configuration

Prepare

Report

Analyse

​Usage

​Inputs & Outputs

​Configuration

Usage

Inputs & Outputs

Configuration