Skip to content

Filter duplicate nodes

network

Remove duplicate nodes in a network.

For each pair of nodes connected by a link that indicates a similarity greater than a specified threshold, keeps only one of the two nodes and rewires the deleted node's incoming and outgoing links to point to the "surviving" node.

Example

To de-duplicate pairs of nodes with a link weight (similarity) greater than 0.9

filter_duplicate_nodes(ds, links, {
  "duplicate_threshold": 0.9
}) -> (ds_filtered, links_filtered)

Usage

The following are the step's expected inputs and outputs and their specific types.

filter_duplicate_nodes(
    data: dataset,
    links: dataset, 
    {
        "param": value
    }
) -> (data_flt: dataset, links_flt: dataset)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Inputs


data: dataset

A dataset containing the nodes (rows) to de-duplicate.


links: dataset

A dataset containing the links between nodes of the input dataset.

Outputs


data_flt: dataset

A new dataset containing the same columns as the input data, but without duplicate nodes.


links_flt: dataset

A new dataset containing the same columns as the input links, but having connections rewired such that none points to a deleted node.

Parameters


remove_duplicates: boolean = True

Whether or not to de-duplicate nodes with similarity greater than duplicate_threshold


duplicate_threshold: number | null

Similarity threshold for candidate nodes to be eliminated. Any node linked to another node with a weight (usually similarity) greater than this value will be eliminated. Default (null) corresponds to positive infinity (no de-duplication).