Skip to content

Steps

Browse all available data transformation steps by category in the left menu or using the sortable table below.

Steps indicated as "Fast" (), are those that can be executed directly, immediately, and usually, well, fast, in your browser. Other steps, normally those requiring more complex processing, more memory or specific execution environments, are executed in our cloud and may or may not take somewhat longer.

Name Fast Description
add_noise Add noise to a column with numbers or lists of numbers.
aggregate Group and aggregate a dataset using any of a number of predefined functions.
aggregate_list_items Group a dataset by elements in a column of lists and aggregate remaining columns using one or more pre…
aggregate_neighbours For each node in a network, group and aggregate over its neighbours.
aggregate_tweets_by_author Group a dataset of tweets by author and calculate relevant author statistics.
append_rows Add rows from one dataset to another.
association_rules Calculate association rules for a items/products in a dataset of transactions.
calculate Evaluates a formula containing basic arithmetic over a dataset's columns.
caption_images Predict image captions using pretrained DL models
cast Fast Interprets and changes a column's data to another (semantic) type.
classify_text Classify texts using any model from the Hugging Face hub
clean_categories Clean a given column of categories or lists of categories using OpenAI
cluster_dataset Identify clusters in the dataset.
cluster_embeddings Identify clusters using the distance between provided embeddings.
cluster_network Fast Identify clusters in the network
cluster_subnetwork Fast Identify clusters in the network by filtering the input dataset
concatenate Fast Concatenate columns as text or lists with optional separator as well as pre- and postfix.
configure_category_colors Configures the color of the categories of a categorical or text column.
configure_category_labels Configures the labels generated for each category.
configure_category_order Configures the order of categories in a categorical or list of categories column.
configure_color_palette Configures the base global color palette to use when coloring categorical columns.
configure_column_metadata Configures the label and/or description of a column.
configure_column_view_modes Configures the order of categories in a categorical or list of categories column.
configure_column_visibility Configures the visibility of a column in different Graphext sections.
configure_columns_order Configures the order of columns (filters) in the Graph and Details sections.
configure_dataset_metadata Configures the info_source and/or description of a dataset.
configure_detail_view Select the preferred columns to customize a row detail view.
configure_discarded_categories Configures a minimum number of rows in a category below which the category will be hidden from the var…
configure_graph_layout Configures the x & y columns used to map node positions in the graph
configure_graph_regions Configures the column that is displayed as the label of the graph region
configure_node_color Configures the column that is used for coloring the nodes by default
configure_node_connections Configures how the connections between the nodes are visualized.
configure_node_picture Configures the pictures associated with the nodes of the network
configure_node_size Configures the column, the minimum and the maximum that are used for sizing the nodes by default.
configure_node_title Configures the column that is displayed as the title of the node
configure_node_url Configures the urls associated with the nodes of the network.
configure_rows_order Configures the order of the rows in the Data section.
configure_sections Configures pinned Graphext sections.
configure_tagged_columns Tag the provided column(s) with the specified tag.
count_unique Fast Counts the number of unique elements in each list/array of the input column.
create_compare_insight Create a new insight from the Compare section.
create_correlations_insight Create a new insight from the Correlations section.
create_filter_insight Create a new insight from a selection of nodes.
create_graph_insight Create a new insight from the Graph section.
create_plot_insight Create a new insight from the Plot section.
create_project Prepare project using the final dataset.
create_text_insight Create a new insight using only plain text.
derive_column Fast Derive a new column with a custom JS script
discretize_on_quantiles Discretize column by binning its values using specified [quantiles](https://en.wikipedia.org/wiki/Quan…
discretize_on_values Discretize column by binning its values using explicitly specified cuts points.
divide Fast Divide two or more numeric columns in given order.
embed_dataset Reduce the dataset to an n-dimensional numeric vector embedding.
embed_images Embed images using pretrained DL models
embed_items Trains an item2vec model on provided lists of items (or sentences of words, etc.).
embed_sessions Trains an item2vec model on provided lists of items.
embed_text Parse and calculate a (word-averaged) embedding vector for each text.
embed_text_with_model Use language models to calulate an embedding for each text in provided column.
equal Fast Check the row-wise equality of all input columns.
explode Explode (extract) items from column(s) of lists into separate rows.
export_to_amazonredshift Export data to Amazon Redshift
export_to_amazons3 Export data to an AmazonS3 bucket
export_to_azureblob Export data to an Azure Storage Blob
export_to_azuresql Export data to Azure SQL
export_to_bigquery Export data to a BigQuery Table
export_to_databricks Export data to Databricks
export_to_gdrive Export data to a Google Drive file
export_to_gsheet Export data to a Google Sheets sheet
export_to_notion Export data to Notion
export_to_snowflake Export data to Snowflake
export_to_sql Export a given dataset to a specified SQL database.
export_to_tinybird Export data to Tinybird
extract_date_component Fast Extract a component such as day, week, weekday etc. from a date column.
extract_emoji Parse texts and extract their emoji.
extract_entities Parse texts and extract the entities mentioned (persons, organizations etc.).
extract_hashtags Parse texts and extract any hashtags mentioned.
extract_json_values Fast Extract values from JSON columns using JsonPath.
extract_keywords Parse and extract keywords from texts.
extract_mentions Parse texts and extract any mentions detected.
extract_ngrams Parse texts and extract their n-grams.
extract_node_betweenness Calculate network node betweenness
extract_node_closeness Calculcate network node closeness
extract_node_degree Calculate network node degrees.
extract_node_pagerank Calculate network node pagerank
extract_range Fast Create a copy of a column nullifying values outside a specified range.
extract_regex Fast Extract parts of texts detected using regular expressions.
extract_text_features Parse and process texts to extract multiple features at once.
extract_url_components Extract components from an URL.
fetch_demographics_es Fetch Spanish demographic census data given a geographical location in each row.
fetch_full_contact_domains Enrich a dataset containing links (URLs) to companies' online presence using the FullContact service.
fetch_full_contact_emails Enrich a dataset containing email addresses with personal information using the FullContact service.
fetch_google_places Fetch information about the most relevant places surrounding a location.
fetch_google_vision Analyze images given their URL using the Google Vision API.
fetch_location Extract formatted address, locality, area, state, country and geographical coordinates from one or mor…
fetch_openreview Fetch publications submitted to one or more conferences via OpenReview
fetch_social_shares Fetch the number of times a Url was shared on Facebook.
fetch_twitter Enriches a dataset containing tweets with information about their authors
fetch_url_content Fetch the main text from a web URL, and return its title, author, content, excerpt and domain.
filter_containing Filter rows containing any or all of a number of specified values.
filter_duplicate_nodes Remove duplicate nodes in a network
filter_duplicates Filter duplicate rows, keeping the first or last of each set of duplicates found only.
filter_missing Filter rows based on missing values in one or more columns.
filter_range Filter rows based on the numeric values in a given column.
filter_row_numbers Filter rows by row number
filter_rows Filter rows using graphext's advanced query syntax (similar to Elasticsearch).
filter_sample Randomly sample the dataset, optionally within groups (can be used to balance a dataset).
filter_topn Sort a dataset by selected columns and pick the first N rows (or exclude them).
filter_values Filter rows where column matches specified values exactly.
filter_with_formula Filter rows using a (pandas-compatible) formula.
infer_gender Try to infer a person's gender given a first name.
infer_language Detect the language used for each text in the input column.
infer_missing Train and use a machine learning model to predict (impute) the missing values in a column
infer_missing_with_probs Train and use a machine learning model to predict (impute) the missing values in a column.
infer_sentiment Parse text and calculate the overall positive or negative sentiment polarity.
is_missing Fast Check for missing values in a given column.
join Join two datasets on their row indexes or on values in specified columns.
label_bios Categorize people into fields of occupation using their bios (biographies)
label_categories Fast Relabel categories based on the top terms in each category
label_encode Encode categories with values between 0 and N-1, where N is the number of unique categories.
label_holidays Indicate if there are any holidays for given date, location pairs.
label_political_subtopics Categorize the political sub-topics of texts in Spanish
label_political_topics Categorize the political topics of texts in Spanish
label_texts_containing Categorize texts containing specific keywords with custom labels.
label_texts_containing_from_q… Label texts given an elastic-like query string
layout_coordinates Fast Create x, y positions for nodes from their geographical coordinates.
layout_dataset Reduce the dataset to 2 dimensions that can be mapped to x/y node positions
layout_igraph Calculate layout, i.e. node positions, for a network.
layout_network Fast Compute a force-directed graph layout with a fast forceAtlas2 implementation.
layout_treemap Place nodes on the screen using a treemap layout.
length Fast Calculates the length of lists (number of elements) or texts/categories (number of characters)
link_embeddings Create network links between rows/nodes calculating the similarity of embeddings (vectors).
link_grouped_embeddings Create network links calculating the similarity of embeddings (vectors) within groups
link_rows Create network links using explicit lists of target IDs, weights and other link attributes.
link_rows_by_id Create network links using one or more lists of target ids.
link_rows_by_rownum Create network links using explicit lists of target row numbers and optional weights.
link_sequence_items Create network links between consecutive pairs in a column of sequences
link_session_items Link items (e.g. products) in sessions (baskets) if one item makes the presence of the other in the sa…
link_similar_columns Calculates all pair-wise column dependencies (by default mutual information)
link_similar_rows Create network links calculating similarity between multidimensional and multitype documents.
make_constant Fast Creates a new constant column (with a single unique value) of the same length as the input column.
math_func Applies a mathematical function to the values of a (single) numeric column.
merge_similar_semantics Group categories with similar meanings.
merge_similar_spellings Group categories with similar spellings.
multiply Fast Multiply two or more numeric columns.
normalize Fast Normalizes a numerical column by subtracting the mean and dividing by its standard deviation.
order_categories Fast (Re-)order the categories of a categorical column
pandas_func Applies an arbitrary pandas supported function to the values of an input column.
pct_change Calculate percentage change between consecutive numbers in a numeric column.
predict_classification Use a pretrained classification model to predict new categorical data
predict_clustering Use a pretrained clustering model to predict new data
predict_dimensionality_reduct… Use a pretrained model to predict embeddings
predict_regression Use a pretrained model to predict new numerical data
prompt_ai Call OpenAI's models on each row of the dataset for a given prompt.
replace_missing Fast Replace missing values (NaNs) with either a specified constant value or the result of a given function.
replace_regex Fast Replace parts of text detected with a regular expression
replace_values Fast Replace specified values in a column with new ones
scale Fast Scales the values of a numerical column to lie between a specified minimum and maximum.
segment_rows Fast Create a segmentation using graphext's advanced query syntax (similar to Elasticsearch).
slice Fast Extract a range/slice of elements from a column of texts or lists
split_string Fast Split a single column containing texts into two.
subtract Fast Subtract two or more numeric columns.
sum Fast Calculate the row-wise sum of numeric columns.
test_classification Evaluate a pretrained classification model on custom test data
test_classification_gpu Evaluate a pretrained classification model on custom test data
test_regression Evaluate a pretrained regression model on custom test data
time_interval Calculates the duration of a time interval between two dates (datetimes/timestamps).
tokenize Parse texts and separate them into lists of tokens (words, lemmas, etc.)
train_classification Train and store a classification model to be loaded at a later point for prediction.
train_classification_gpu Train and store a classification model to be loaded at a later point for prediction.
train_clustering Train and store a machine learning model to be loaded at a later point for prediction.
train_dimensionality_reduction Train and store a machine learning model to be loaded at a later point for prediction.
train_regression Train and store a regression model to be loaded at a later point for prediction.
trim_frequencies Remove values whose frequencies (counts) are above/below a given threshold.
unique Fast Extracts the unique elements in each list/array.
unpack_list Unpack (extract) items from a column of lists into separate columns.
upsample Upsample a dataset given a weight column.
vectorize_dataset Create a vectorized (numeric) dataset, (optionally) of reduced dimensionality
zeroshot_classify_text Classify texts using custom labels/categories