Skip to content

Filter duplicate nodes


Remove duplicate nodes in a network.

For each pair of nodes connected by a link that indicates a similarity greater than a specified threshold, keeps only one of the two nodes and rewires the deleted node's incoming and outgoing links to point to the "surviving" node.


To de-duplicate pairs of nodes with a link weight (similarity) greater than 0.9

filter_duplicate_nodes(ds, links, {
  "duplicate_threshold": 0.9
}) -> (ds_filtered, links_filtered)


The following are the step's expected inputs and outputs and their specific types.

    data: dataset,
    links: dataset, 
        "param": value
) -> (data_flt: dataset, links_flt: dataset)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.


data: dataset

A dataset containing the nodes (rows) to de-duplicate.

links: dataset

A dataset containing the links between nodes of the input dataset.


data_flt: dataset

A new dataset containing the same columns as the input data, but without duplicate nodes.

links_flt: dataset

A new dataset containing the same columns as the input links, but having connections rewired such that none points to a deleted node.


remove_duplicates: boolean = True

Whether or not to de-duplicate nodes with similarity greater than duplicate_threshold

duplicate_threshold: number | null

Similarity threshold for candidate nodes to be eliminated. Any node linked to another node with a weight (usually similarity) greater than this value will be eliminated. Default (null) corresponds to positive infinity (no de-duplication).