aggregate_neighbours

Using the link columns in the provided dataset (including at least a targets columns containing lists of target row numbers that each row connects to), for each row calculate requested aggregations over all its direct (first-degree) neighbours. Will use the first set of link columns encountered in the datasets metadata.

Usage

The following example shows how the step can be used in a recipe.

Examples

Assuming a dataset products where each row represents a supermarket product (having at least a price and aisle column), and containing a targets column dataset representing connections between similar products, the following example calculates for each product

the average price of similar products
the percentage of similar products assigned to aisles “produce”, “deli” and “drinks”

aggregate_neighbours(products, {
  "aggregations": {
    "price": {
      "similar_price_avg": {"func": "mean"}
    },
    "aisle": {
      "similar_pct_produce": {"func": "percent_where", "value": "produce"},
      "similar_pct_deli": {"func": "percent_where", "value": "deli"},
      "similar_pct_drinks": {"func": "percent_where", "value": "drinks"}
    }
  }
}) -> (products_agg)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Inputs

Outputs

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

presort

object

Pre-aggregation row sorting. Sort the dataset rows before aggregating, e.g. when in a particular aggregation function (such as list) the encountered order is important.

Properties

Examples

aggregations

object

required

Definition of desired aggregations. A dictionary mapping original columns to new aggregated columns, specifying an aggregation function for each. Aggregations are functions that reduce all the values in a particular column of a single group to a single summary value of that group. E.g. a sum aggregation of column A calculates a single total by adding up all the values in A belonging to each group.Possible aggregations functions accepted as func parameters are:

n, size or count: calculate number of rows in group
sum: sum total of values
mean: take mean of values
max: take max of values
min: take min of values
mode: find most frequent value (returns first mode if multiple exist)
first: take first item found
last: take last item found
unique: collect a list of unique values
n_unique: count the number of unique values
list: collect a list of all values
concatenate: convert all values to text and concatenate them into one long text
concat_lists: concatenate lists in all rows into a single larger list
count_where: number of rows in which the column matches a value, needs parameter value with the value that you want to count
percent_where: percentage of the column where the column matches a value, needs parameter value with the value that you want to count

Note that in the case of count_where and percent_where an additional value parameter is required.

Item properties

input_aggregations

object

One item per input column. Each key should be the name of an input column, and each value an object defining one or more aggregations for that column. An individual aggregation consists of the name of a desired output column, mapped to a specific aggregation function. For example:

{
"input_col": {
"output_col": {"func": "sum"}
}
}

Item properties

aggregation_func

object

Object defining how to aggregate a single output column. Needs at least the "func" parameter. If the aggregation function accepts further arguments, like the "value" parameter in case of count_where and percent_where, these need to be provided also. For example:

{
"output_col": {"func": "count_where", "value": 2}
}

Properties

Examples

Including an aggregation function with additional parameters:

{
"product_id": {
"products": {"func": "list"},
"size": {"func": "count"}
},
"item_total": {
"total": {"func": "sum"},
},
"item_category": {
"num_food_items": {"func": "count_where", "value": "food"}
}
}

directed

boolean

default:"false"

Whether the links provided should be interpreted as being directed. Directed here meaning that the link A→B (from node A to B) may be different from the link B→A (i.e. they may have different weight attributes for example). When "directed": false, in contrast, i.e. links are undirected, it is assumed that the link A→B is always identical to B→A (i.e. A↔B always). This is usually the case when links represent a similarity between nodes.

Prepare

Report

Analyse

Usage

Inputs & Outputs

Configuration

Prepare

Report

Analyse

​Usage

​Inputs & Outputs

​Configuration

Usage

Inputs & Outputs

Configuration