Skip to main content
Works like the generic aggregate step, but with a predefined set of aggregation functions. See the ds_out argument below for the columns generated in the resulting dataset.

Usage

The following shows how the step can be used in a recipe.

Examples

  • Signature
General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.
aggregate_tweets_by_author(ds_in: dataset, {
    "param": value,
    ...
}) -> (ds_out: dataset)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").
ds_in
dataset
required
A dataset where each row is a tweet.
ds_out
dataset
required
Result of the aggregation, where each row is a twitter account. It will include for each author up to the following columns, depending on information present on the original dataset:
  • author_id: Official Twitter ID
  • tweet_count: Number of tweets by this author
  • handler: Official Twitter handle
  • name: User name
  • pic: Link to user’s profile picture
  • links: A list of links mentioned by the user
  • dates: A list of dates of published tweets by this author
  • tweet_ids: The official Twitter IDs of the tweets published by the author
  • retweets: The number of retweets received
  • favorites: The number of favorites received
  • mention_ids: List of other accounts (IDs) the author has mentioned
  • mention_names: List of other accounts (names) the author has mentioned
  • rp_user_ids: List of other accounts (IDs) the author has replied to
  • rp_user_names: List of other accounts (names) the author has replied to
  • mentions: The count of mentions received
  • replies: The count of replies received
  • tweet_text: The text of the author’s tweets, concatenated.

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

add_referenced_accounts
boolean
default:"true"
Whether to add rows for accounts only “mentioned” in original tweets. If mentions or replies are recorded in the dataset (in columns mention_ids, mention_names and/or rp_user_id, rp_user_name) will add the corresponding accounts as rows in the result, even if they didn’t have a tweet in the original dataset.Will add mentions and replies columns recording how many times the accounts were mentioned or replied to.
column_map
object
Column Map. If the names of any of your dataset’s columns don’t correspond to those we expect to find in a tweet dataset (e.g. originating in Twitter’s own API), you can provide a mapping of of the sort {"your_column": "author_id"}.The expected column names are [author_id, author_handler, author_name, author_avatar, links, date, id, retweets, favorites, mention_ids, mention_names, rp_user_id, rp_user_name , text].
I