Skip to content

Link similar columns

network · correlation

Calculates all pair-wise column dependencies (by default mutual information).

Create a network where nodes represent the dataset's variables (its columns instead of its rows), and links represent dependencies between variables (i.e. associations/correlations, by default measures "mutual information").

In effect, all pair-wise "correlations" between the dataset's columns are calculated. A threshold is then applied to extract only the largest (most interesting) "correlations". These are then translated into network links between nodes representing the original variables. Each node in the new dataset will also contain information about its correlation with all other nodes (variables).

Note: in this first version, only quantitative and categorical variables will be analyzed (but not tags, lists, embeddings etc.).


The step accepts a dataset as input and produces a new "nodes" dataset (containing one row for each original column), as well as a "links" dataset, containing the "correlations" between the nodes above a given threshold.

links_similar_columns(ds) -> (nodes, links)
More examples

The following example will create a network in which all original variables are present and each variable is connected to all others. However, for pairs of variables whose correlation falls in the bottom 75%, the weight of the corresponding link will be set to 0.

links_similar_columns(ds, {
  "min_similarity_quantile": 0.75,
  "missing_weight": 0
}) -> (nodes, links)


The following are the step's expected inputs and outputs and their specific types.

link_similar_columns(ds: dataset, {"param": value}) -> (nodes: dataset, links: dataset)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.


ds: dataset

Input dataset containing arbitrary columns.


nodes: dataset

A dataset containing M rows and M columns (where M is the number of columns in the input dataset). Each row represents a variable in the original dataset, and the columns contain the "correlations" with the remaining variables.

links: dataset

A dataset of links (containing source, target and weight columns), where each link connects two nodes representing variables in the original dataset, and it's weight corresponding to the "correlation" between the two variables.


min_similarity: number = 0.0

Absolute similarity threshold. The minimum "correlation" for the creation of a link between two variables.

Range: -1 ≤ min_similarity ≤ 1

min_similarity_quantile: number = 0.6

Similarity threshold expressed as a quantile. E.g. a value of 0.6 means the bottom 60% of "correlations" will be discarded. Both minima (absolute and quantile) must be exceeded for a link to be created.

Range: 0.0 ≤ min_similarity_quantile ≤ 1

missing_weight: number | null

Replacement for discarded correlations. The weight of links whose correlations don't pass the minimum threshold. Links with weights of null will be discarded (the default behavior). Can be set e.g. to 0, to generate all possible links.

n_samples: integer = 2000

Number of Samples. It represents the maximum number of samples to use when measuring "correlations".

top_n_links: integer | null

Maximum number of links per node. Ranks the links by weight and keeps only the most similar targets.