filter_duplicate_nodes
Remove duplicate nodes in a network.
For each pair of nodes connected by a link that indicates a similarity greater than a specified threshold, keeps only one of the two nodes and rewires the deleted node’s incoming and outgoing links to point to the “surviving” node.
Usage
The following example shows how the step can be used in a recipe.
Examples
Examples
To de-duplicate pairs of nodes with a link weight (similarity) greater than 0.9
To de-duplicate pairs of nodes with a link weight (similarity) greater than 0.9
General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.
Inputs & Outputs
The following are the inputs expected by the step and the outputs it produces. These are generally
columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Inputs
Inputs
dataset containing the nodes (rows) to de-duplicate and the links between nodes of the input dataset.
Outputs
Outputs
A new dataset containing the same columns as the input data
, but without duplicate rows and having connections rewired such that none
points to a deleted node.
Configuration
The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output)
.
Parameters
Parameters
Similarity threshold for candidate nodes to be eliminated.
Any node linked to another node with a weight (usually similarity) greater than this value
will be eliminated. Default (null
) corresponds to positive infinity (no de-duplication).