link_similar_columns
Calculates all pair-wise column dependencies (by default mutual information).
Create a new network dataset where nodes (rows) represent the original dataset’s variables (its columns), and links represent dependencies between variables (i.e. associations/correlations). By default the step measures “mutual information” between variables.
In effect, all pair-wise “correlations” between the original dataset’s columns are calculated. A threshold is then applied to extract only the largest (most interesting) “correlations”. These are then translated into network links between nodes representing the original variables. Each node/row in the new dataset will also contain information about its correlation with all other nodes (variables).
Note: in this first version, only quantitative and categorical variables will be analyzed (but not tags, lists, embeddings etc.). You can pass a dataset with arbitrary column types, but those not supported by the selected correlation method will be ignored in the result.
Was this page helpful?