Skip to content

Aggregate tweets by author

group by

Group a dataset of tweets by author and calculate relevant author statistics.

Works like the generic aggregate step, but with a predefined set of aggregation functions. See the ds_out argument below for the columns generated in the resulting dataset.

Usage


The following are the step's expected inputs and outputs and their specific types.

Step signature
aggregate_tweets_by_author(ds_in: dataset, {
    "param": value
}) -> (ds_out: dataset)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Inputs


ds_in: dataset

A dataset where each row is a tweet.

Outputs


ds_out: dataset

Result of the aggregation, where each row is a twitter account. It will include for each author up to the following columns, depending on information present on the original dataset:

  • author_id: Official Twitter ID
  • tweet_count: Number of tweets by this author
  • handler: Official Twitter handle
  • name: User name
  • pic: Link to user's profile picture
  • links: A list of links mentioned by the user
  • dates: A list of dates of published tweets by this author
  • tweet_ids: The official Twitter IDs of the tweets published by the author
  • retweets: The number of retweets received
  • favorites: The number of favorites received
  • mention_ids: List of other accounts (IDs) the author has mentioned
  • mention_names: List of other accounts (names) the author has mentioned
  • rp_user_ids: List of other accounts (IDs) the author has replied to
  • rp_user_names: List of other accounts (names) the author has replied to
  • mentions: The count of mentions received
  • replies: The count of replies received
  • tweet_text: The text of the author's tweets, concatenated.

Parameters


add_referenced_accounts: boolean = True

Whether to add rows for accounts only "mentioned" in original tweets. If mentions or replies are recorded in the dataset (in columns mention_ids, mention_names and/or rp_user_id, rp_user_name) will add the corresponding accounts as rows in the result, even if they didn't have a tweet in the original dataset.

Will add mentions and replies columns recording how many times the accounts were mentioned or replied to.


column_map: object

Column Map. If the names of any of your dataset's columns don't correspond to those we expect to find in a tweet dataset (e.g. originating in Twitter's own API), you can provide a mapping of of the sort {"your_column": "author_id"}.

The expected column names are [author_id, author_handler, author_name, author_avatar, links, date, id, retweets, favorites, mention_ids, mention_names, rp_user_id, rp_user_name , text]

Items in column_map