Skip to content

Aggregate neighbours

group by

For each node in a network, group and aggregate over its neighbours.

Using a primary dataset representing nodes (rows), and an associated dataset representing links (between the rows), for each node calculate requested aggregations over all its direct (first-degree) neighbours.

Example

Assuming a primary dataset products where each row represents a supermarket product (having at least a price and aisle column), and a links dataset representing connections between similar products, the following example calculates for each product

  • the average price of similar products
  • the percentage of similar products assigned to aisles "produce", "deli" and "drinks"
aggregate_neighbours(products, links, {
  "aggregations": {
    "price": {
      "similar_price_avg": {"func": "mean"}
    },
    "aisle": {
      "similar_pct_produce": {"func": "percent_where", "value": "produce"},
      "similar_pct_deli": {"func": "percent_where", "value": "deli"},
      "similar_pct_drinks": {"func": "percent_where", "value": "drinks"}
    }
  }
}) -> (products_agg)

Usage

The following are the step's expected inputs and outputs and their specific types.

aggregate_neighbours(
    ds_in: dataset,
    links: dataset, 
    {
        "param": value
    }
) -> (ds_out: dataset)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Inputs


ds_in: dataset

A dataset containing the nodes (rows) to group and aggregate.


links: dataset

A dataset of links containing at least the columns source, target and weight (to create one, refer to steps starting with the "link_" prefix),

Outputs


ds_out: dataset

The original dataset plus newly aggregated columns. Will have one column per specified aggregation function (more than one aggregation can be specified for each original input column).

Parameters


presort: object

Pre-aggregation row sorting. Sort the dataset rows before aggregating, e.g. when in a particular aggregation function (such as list) the encountered order is important.

Items in presort

columns: null | string | array[string]

The sort column name(s). These column(s) will be used to sort the dataset before aggregating (if multiple, in specified order). E.g. to first sort links by their weight, and if the weight column is called "gx_weight", use "gx_weight"

Example parameter values:

  • "date_added"
  • ["lastname", "firstname"]

ascending: boolean = True

Whether to sort in ascending order (or in descending order if false).

Example parameter values:

  • For example, to sort first by price, then dimension, and in descending order:

    {
      "columns": ["price", "dimension"],
      "ascending": false
    }
    

aggregations: object

Definition of desired aggregations. A dictionary mapping original columns to new aggregated columns, specifying an aggregation function for each. Aggregations are functions that reduce all the values in a particular column of a single group to a single summary value of that group. E.g. a sum aggregation of column A calculates a single total by adding up all the values in A belonging to each group.

Possible aggregations functions accepted as func parameters are:

  • n, size or count: calculate number of rows in group
  • sum: sum total of values
  • mean: take mean of values
  • max: take max of values
  • min: take min of values
  • first: take first item found
  • last: take last item found
  • unique: collect a list of unique values
  • n_unique: count the number of unique values
  • list: collect a list of all values
  • concatenate: convert all values to text and concatenate them into one long text
  • concat_lists: concatenate lists in all rows into a single larger list
  • count_where: number of rows in which the column matches a value, needs parameter value with the value that you want to count
  • percent_where: percentage of the column where the column matches a value, needs parameter value with the value that you want to count

Note that in the case of count_where and percent_where an additional value parameter is required.

Items in aggregations

input_aggregations: object

One item per input column. Each key should be the name of an input column, and each value an object defining one or more aggregations for that column. An individual aggregation consists of the name of a desired output column, mapped to a specific aggregation function. For example:

{
  "input_col": {
    "output_col": {"func": "sum"}
  }
}
Items in input_aggregations

aggregation_func: object

Object defining how to aggregate a single output column. Needs at least the "func" parameter. If the aggregation function accepts further arguments, like the "value" parameter in case of count_where and percent_where, these need to be provided also. For example:

{
  "output_col": {"func": "count_where", "value": 2}
}
Items in aggregation_func

func: string

Aggregation function.

Must be one of: "n", "size", "count", "sum", "mean", "n_unique", "count_where", "percent_where", "concatenate", "max", "min", "first", "last", "concat_lists", "unique", "list"

Example parameter values:

  • Including an aggregation function with additional parameters:

    {
      "product_id": {
        "products": {"func": "list"},
        "size": {"func": "count"}
      },
      "item_total": {
        "total": {"func": "sum"},
      },
      "item_category": {
        "num_food_items": {"func": "count_where", "value": "food"}
      }
    }
    

directed: boolean = False

Whether the links provided should be interpreted as being directed. Directed here meaning that the link A→B (from node A to B) may be different from the link B→A (i.e. they may have different weight attributes for example). When "directed": false, in contrast, i.e. links are undirected, it is assumed that the link A→B is always identical to B→A (i.e. A↔B always). This is usually the case when links represent a similarity between nodes.