# Link similar columns¶

network • correlation

Calculates all pair-wise column dependencies (by default mutual information).

Create a new network dataset where nodes (rows) represent the original dataset's variables (its columns), and links represent dependencies between variables (i.e. associations/correlations). By default the step measures "mutual information" between variables.

In effect, all pair-wise "correlations" between the original dataset's columns are calculated. A threshold is then applied to extract only the largest (most interesting) "correlations". These are then translated into network links between nodes representing the original variables. Each node/row in the new dataset will also contain information about its correlation with all other nodes (variables).

Note: in this first version, only quantitative and categorical variables will be analyzed (but not tags, lists, embeddings etc.). You can pass a dataset with arbitrary column types, but those not supported by the selected correlation method will be ignored in the result.

## Usage¶

The following are the step's expected inputs and outputs and their specific types.

```
link_similar_columns(ds_in: dataset, {
"param": value
}) -> (ds_out: dataset)
```

where the object `{"param": value}`

is optional in most cases and if present may contain any of the parameters described in the
corresponding section below.

#### Example¶

The following example calculates all correlations without filtering, leading to a fully connected correlation network unless some correlations are exactly 0.

```
links_similar_columns(ds) -> (corrs)
```

## More examples

The following example removes correlations below the 75^{th} percentile, creating a network only connecting the most correlated variables.

```
links_similar_columns(ds, {
"min_similarity_quantile": 0.75,
}) -> (corrs)
```

## Inputs¶

ds_in: dataset

Input dataset containing arbitrary columns to calculate "correlations" for.

## Outputs¶

ds_out: dataset

A dataset containing M rows and M+2 columns (where M is the number of columns in the input dataset). Each row represents a variable in the original dataset, and the columns contain the "correlations" with the remaining variables. An additional 2 columns ("targets" and "weights") contain links connecting original variables to other variables they're correlated with.

## Parameters¶

method: string = "mutual_information"

Correlation method/statistic. Which statistic to use to measure variable association/correlation.

The default is `"mutual_information"`

, which applies scikit-learn`s k-nearest neighbors implementations (see
here and
here).
It supports both categorical and quantitative variables, is relatively fast, but doesn't have a natural upper bound
(i.e. values are not in the range [0, 1]).

The `"distance_correlation"`

method also
supports both categorical and quantitative variables, and has a natural upper bound of 1. It's relatively
slow though, so make sure to select a reasonable value for `n_samples`

.

`"distance_correlation_fast"`

uses an optimized implementation
of distance correlation, but only supports quantitative variables.

`"pearson"`

is the standard Pearson correlation coefficient,
which also only supports quantitative variables.

Lastly, `"predictive_power"`

calculates a version of the predictive power score (PPS).
This essentially fits a decision tree to predict variable y using only variable x as a predictor, and measures
it performance relative to a dummy/baseline prediction. It supports both categorical and quantitative variables.

Must be one of:
`"mutual_information"`

,
`"distance_correlation"`

,
`"distance_correlation_fast"`

,
`"pearson"`

,
`"predictive_power"`

min_similarity: number = 0.0

Absolute similarity threshold. The minimum "correlation" for the creation of a link between two variables.

min_similarity_quantile: number = 0.5

Similarity threshold expressed as a quantile. E.g. a value of 0.6 means the bottom 60% of "correlations" will be discarded. Both minima (absolute and quantile) must be exceeded for a link to be created.

Range: `0.0 ≤ min_similarity_quantile ≤ 1`

missing_weight: number | null

Replacement for discarded correlations. The weight of links whose correlations don't pass the minimum threshold. Links with weights of `null`

will be
discarded (the default behavior). Can be set e.g. to 0, to generate all possible links.

n_samples: integer = 2000

Number of Samples. It represents the maximum number of samples to use when measuring "correlations".

top_n_links: integer | null

Maximum number of links per node. Ranks the links by weight and keeps only the most similar targets.

random_seed: integer = 0

The random seed used if applicable for selected `method`