aggregate_neighbours
For each node in a network, group and aggregate over its neighbours.
Using the link columns in the provided dataset (including at least a targets columns containing lists of target row numbers that each row connects to), for each row calculate requested aggregations over all its direct (first-degree) neighbours.
Will use the first set of link columns encountered in the datasets metadata.
Usage
The following example shows how the step can be used in a recipe.
Assuming a dataset products
where each row represents a supermarket product (having at least a price
and aisle
column),
and containing a targets column dataset representing connections between similar products, the following example calculates for
each product
- the average price of similar products
- the percentage of similar products assigned to aisles “produce”, “deli” and “drinks”
Inputs & Outputs
The following are the inputs expected by the step and the outputs it produces. These are generally
columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Configuration
The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output)
.
Pre-aggregation row sorting.
Sort the dataset rows before aggregating, e.g. when in a particular aggregation function (such as list
) the encountered order is important.
Definition of desired aggregations.
A dictionary mapping original columns to new aggregated columns, specifying an aggregation function for each.
Aggregations are functions that reduce all the values in a particular column of a single group to a single summary value of that group.
E.g. a sum
aggregation of column A calculates a single total by adding up all the values in A belonging to each group.
Possible aggregations functions accepted as func
parameters are:
n
,size
orcount
: calculate number of rows in groupsum
: sum total of valuesmean
: take mean of valuesmax
: take max of valuesmin
: take min of valuesfirst
: take first item foundlast
: take last item foundunique
: collect a list of unique valuesn_unique
: count the number of unique valueslist
: collect a list of all valuesconcatenate
: convert all values to text and concatenate them into one long textconcat_lists
: concatenate lists in all rows into a single larger listcount_where
: number of rows in which the column matches a value, needs parametervalue
with the value that you want to countpercent_where
: percentage of the column where the column matches a value, needs parametervalue
with the value that you want to count
Note that in the case of count_where
and percent_where
an additional value
parameter is required.
Whether the links provided should be interpreted as being directed.
Directed here meaning that the link A→B (from node A to B) may be different from the link B→A (i.e. they may
have different weight attributes for example). When "directed": false
, in contrast, i.e. links are undirected,
it is assumed that the link A→B is always identical to B→A (i.e. A↔B always). This is usually the case when
links represent a similarity between nodes.
Was this page helpful?