Steps¶
Browse all available data transformation steps by category in the left menu or using the sortable table below.
Steps indicated as "Fast" (), are those that can be executed directly, immediately, and usually, well, fast, in your browser. Other steps, normally those requiring more complex processing, more memory or specific execution environments, are executed in our cloud and may or may not take somewhat longer.
Name | Fast | Description |
---|---|---|
add_noise | Add noise to a column with numbers or lists of numbers. | |
aggregate | Group and aggregate a dataset using any of a number of predefined functions. | |
aggregate_list_items | Group a dataset by elements in a column of lists and aggregate remaining columns using one or more pre… | |
aggregate_neighbours | For each node in a network, group and aggregate over its neighbours. | |
aggregate_tweets_by_author | Group a dataset of tweets by author and calculate relevant author statistics. | |
append_rows | Add rows from one dataset to another. | |
association_rules | Calculate association rules for a items/products in a dataset of transactions. | |
calculate | Evaluates a formula containing basic arithmetic over a dataset's columns. | |
caption_images | Predict image captions using pretrained DL models | |
cast | Interprets and changes a column's data to another (semantic) type. | |
classify_text | Classify texts using any model from the Hugging Face hub | |
cluster_dataset | Identify clusters in the dataset. | |
cluster_embeddings | Identify clusters using the distance between provided embeddings. | |
cluster_network | Fast | Identify clusters in the network |
concatenate | Fast | Concatenate columns as text or lists with optional separator as well as pre- and postfix. |
configure_category_colors | Configures the color of the categories of a categorical or text column. | |
configure_category_labels | Configures the labels generated for each category. | |
configure_category_order | Configures the order of categories in a categorical or list of categories column. | |
configure_color_palette | Configures the base global color palette to use when coloring categorical columns. | |
configure_column_metadata | Configures the label and/or description of a column. | |
configure_column_view_modes | Configures the order of categories in a categorical or list of categories column. | |
configure_column_visibility | Configures the visibility of a column in different Graphext sections. | |
configure_columns_order | Configures the order of columns (filters) in the Graph and Details sections. | |
configure_dataset_metadata | Configures the info_source and/or description of a dataset. | |
configure_discarded_categories | Configures a minimum number of rows in a category below which the category will be hidden from the var… | |
configure_graph_layout | Configures the x & y columns used to map node positions in the graph | |
configure_graph_regions | Configures the column that is displayed as the label of the graph region | |
configure_node_color | Configures the column that is used for coloring the nodes by default | |
configure_node_connections | Configures how the connections between the nodes are visualized. | |
configure_node_picture | Configures the pictures associated with the nodes of the network | |
configure_node_size | Configures the column, the minimum and the maximum that are used for sizing the nodes by default. | |
configure_node_title | Configures the column that is displayed as the title of the node | |
configure_node_url | Configures the urls associated with the nodes of the network. | |
configure_rows_order | Configures the order of the rows in the Data section. | |
configure_sections | Configures pinned Graphext sections. | |
count_unique | Fast | Counts the number of unique elements in each list/array of the input column. |
create_compare_insight | Create a new insight from the Compare section. | |
create_correlations_insight | Create a new insight from the Correlations section. | |
create_filter_insight | Create a new insight from a selection of nodes. | |
create_graph_insight | Create a new insight from the Graph section. | |
create_plot_insight | Create a new insight from the Plot section. | |
create_project | Prepare project using the final dataset. | |
create_text_insight | Create a new insight using only plain text. | |
derive_column | Fast | Derive a new column with a custom JS script |
discretize_on_quantiles | Discretize column by binning its values using specified [quantiles](https://en.wikipedia.org/wiki/Quan… | |
discretize_on_values | Discretize column by binning its values using explicitly specified cuts points. | |
divide | Fast | Divide two or more numeric columns in given order. |
embed_dataset | Reduce the dataset to an n-dimensional numeric vector embedding. | |
embed_images | Embed images using pretrained DL models | |
embed_items | Trains an item2vec model on provided lists of items (or sentences of words, etc.). | |
embed_sessions | Trains an item2vec model on provided lists of items. | |
embed_text | Parse and calculate a (word-averaged) embedding vector for each text. | |
embed_text_with_model | Use language models to calulate an embedding for each text in provided column. | |
equal | Fast | Check the row-wise equality of all input columns. |
explode | Explode (extract) items from column(s) of lists into separate rows. | |
export_to_amazonredshift | Export data to Amazon Redshift | |
export_to_amazons3 | Export data to an AmazonS3 bucket | |
export_to_azureblob | Export data to an Azure Storage Blob | |
export_to_azuresql | Export data to Azure SQL | |
export_to_bigquery | Export data to a BigQuery Table | |
export_to_databricks | Export data to Databricks | |
export_to_gdrive | Export data to a Google Drive file | |
export_to_gsheet | Export data to a Google Sheets sheet | |
export_to_notion | Export data to Notion | |
export_to_snowflake | Export data to Snowflake | |
export_to_sql | Export a given dataset to a specified SQL database. | |
export_to_tinybird | Export data to Tinybird | |
extract_date_component | Fast | Extract a component such as day, week, weekday etc. from a date column. |
extract_emoji | Parse texts and extract their emoji. | |
extract_entities | Parse texts and extract the entities mentioned (persons, organizations etc.). | |
extract_hashtags | Parse texts and extract any hashtags mentioned. | |
extract_json_values | Fast | Extract values from JSON columns using JsonPath. |
extract_keywords | Parse and extract keywords from texts. | |
extract_mentions | Parse texts and extract any mentions detected. | |
extract_ngrams | Parse texts and extract their n-grams. | |
extract_node_betweenness | Calculate network node betweenness | |
extract_node_closeness | Calculcate network node closeness | |
extract_node_degree | Calculate network node degrees. | |
extract_node_pagerank | Calculate network node pagerank | |
extract_range | Fast | Create a copy of a column nullifying values outside a specified range. |
extract_regex | Fast | Extract parts of texts detected using regular expressions. |
extract_text_features | Parse and process texts to extract multiple features at once. | |
extract_url_components | Extract components from an URL. | |
fetch_demographics_es | Fetch Spanish demographic census data given a geographical location in each row. | |
fetch_full_contact_domains | Enrich a dataset containing links (URLs) to companies' online presence using the FullContact service. | |
fetch_full_contact_emails | Enrich a dataset containing email addresses with personal information using the FullContact service. | |
fetch_google_nlp_classify_text | Detect topics in text content using Google Cloud's text classification API. | |
fetch_google_nlp_entities_sen… | Analyze sentiment about entities in a text using Google Cloud's NLP endpoint. | |
fetch_google_nlp_text_sentiment | Analyze the overall sentiment of texts using Google Cloud's NLP endpoint. | |
fetch_google_places | Fetch information about the most relevant places surrounding a location. | |
fetch_google_vision | Analyze images given their URL using the Google Vision API. | |
fetch_location | Extract formatted address, locality, area, state, country and geographical coordinates from one or mor… | |
fetch_meaningCloud_sentence_s… | Analyze sentence and entity sentiments in a text with MeaningCloud. | |
fetch_meaningCloud_text_senti… | Analyze sentiment of a text and its entities with MeaningCloud. | |
fetch_openreview | Fetch publications submitted to one or more conferences via OpenReview | |
fetch_social_shares | Fetch the number of times a Url was shared on Facebook. | |
fetch_twitter | Enriches a dataset containing tweets with information about their authors | |
fetch_url_content | Fetch the main text from a web URL, and return its title, author, content, excerpt and domain. | |
fetch_weather_daily | Fetch daily weather data for given times and locations. | |
fetch_weather_hourly | Fetch hourly weather data for given times and locations. | |
filter_containing | Filter rows containing any or all of a number of specified values. | |
filter_duplicate_nodes | Remove duplicate nodes in a network | |
filter_duplicates | Filter duplicate rows, keeping the first or last of each set of duplicates found only. | |
filter_missing | Filter rows based on missing values in one or more columns. | |
filter_range | Filter rows based on the numeric values in a given column. | |
filter_row_numbers | Filter rows by row number | |
filter_sample | Randomly sample the dataset, optionally within groups (can be used to balance a dataset). | |
filter_topn | Sort a dataset by selected columns and pick the first N rows (or exclude them). | |
filter_values | Filter rows where column matches specified values exactly. | |
filter_with_formula | Filter rows using a (pandas-compatible) formula. | |
infer_gender | Try to infer a person's gender given a first name. | |
infer_language | Detect the language used for each text in the input column. | |
infer_missing | Train and use a machine learning model to predict (impute) the missing values in a column | |
infer_missing_with_probs | Train and use a machine learning model to predict (impute) the missing values in a column. | |
infer_sentiment | Parse text and calculate the overall positive or negative sentiment polarity. | |
join | Join two datasets on their row indexes or on values in specified columns. | |
label_bios | Categorize people into fields of occupation using their bios (biographies) | |
label_categories | Fast | Relabel categories based on the top terms in each category |
label_encode | Encode categories with values between 0 and N-1, where N is the number of unique categories. | |
label_holidays | Indicate if there are any holidays for given date, location pairs. | |
label_political_subtopics | Categorize the political sub-topics of texts in Spanish | |
label_political_topics | Categorize the political topics of texts in Spanish | |
label_texts_containing | Categorize texts containing specific keywords with custom labels. | |
label_texts_containing_from_q… | Label texts given a query of the form "word1 ; word2 OR word3" | |
layout_coordinates | Fast | Create x, y positions for nodes from their geographical coordinates. |
layout_dataset | Reduce the dataset to 2 dimensions that can be mapped to x/y node positions | |
layout_igraph | Calculate layout, i.e. node positions, for a network. | |
layout_network | Fast | Compute a force-directed graph layout with a fast forceAtlas2 implementation. |
layout_treemap | Place nodes on the screen using a treemap layout. | |
length | Fast | Calculates the length of lists (number of elements) or texts/categories (number of characters) |
link_embeddings | Create network links between rows/nodes calculating the similarity of embeddings (vectors). | |
link_grouped_embeddings | Create network links calculating the similarity of embeddings (vectors) within groups | |
link_rows | Create network links using explicit lists of target IDs, weights and other link attributes. | |
link_rows_by_id | Create network links using one or more lists of target ids. | |
link_rows_by_rownum | Create network links using explicit lists of target row numbers and optional weights. | |
link_sequence_items | Create network links between consecutive pairs in a column of sequences | |
link_session_items | Link items (e.g. products) in sessions (baskets) if one item makes the presence of the other inthe sam… | |
link_similar_columns | Calculates all pair-wise column dependencies (by default mutual information) | |
link_similar_rows | Create network links calculating similarity between multidimensional and multitype documents. | |
make_constant | Fast | Creates a new constant column (with a single unique value) of the same length as the input column. |
math_func | Applies a mathematical function to the values of a (single) numeric column. | |
merge_similar_semantics | Group categories with similar meanings. | |
merge_similar_spellings | Group categories with similar spellings. | |
multiply | Fast | Multiply two or more numeric columns. |
normalize | Fast | Normalizes a numerical column by subtracting the mean and dividing by its standard deviation. |
order_categories | Fast | (Re-)order the categories of a categorical column |
pandas_func | Applies an arbitrary pandas supported function to the values of an input column. | |
pct_change | Calculate percentage change between consecutive numbers in a numeric column. | |
predict_classification | Use a pretrained classification model to predict new categorical data | |
predict_clustering | Use a pretrained clustering model to predict new data | |
predict_dimensionality_reduct… | Use a pretrained model to predict embeddings | |
predict_regression | Use a pretrained model to predict new numerical data | |
replace_missing | Fast | Replace missing values (NaNs) with either a specified constant value or the result of a given function. |
replace_regex | Fast | Replace parts of text detected with a regular expression |
replace_values | Fast | Replace specified values in a column with new ones |
scale | Fast | Scales the values of a numerical column to lie between a specified minimum and maximum. |
segment_rows | Fast | Create a segmentation using graphext's advanced query syntax (similar to Elasticsearch). |
slice | Fast | Extract a range/slice of elements from a column of texts or lists |
split_string | Fast | Split a single column containing texts into two. |
subtract | Fast | Subtract two or more numeric columns. |
sum | Fast | Calculate the row-wise sum of numeric columns. |
test_classification | Evaluate a pretrained classification model on custom test data | |
test_regression | Evaluate a pretrained regression model on custom test data | |
time_interval | Calculates the duration of a time interval between two dates (datetimes/timestamps). | |
tokenize | Parse texts and separate them into lists of tokens (words, lemmas, etc.) | |
train_classification | Train and store a classification model to be loaded at a later point for prediction. | |
train_clustering | Train and store a machine learning model to be loaded at a later point for prediction. | |
train_dimensionality_reduction | Train and store a machine learning model to be loaded at a later point for prediction. | |
train_regression | Train and store a regression model to be loaded at a later point for prediction. | |
trim_frequencies | Remove values whose frequencies (counts) are above/below a given threshold. | |
unique | Fast | Extracts the unique elements in each list/array. |
unpack_list | Unpack (extract) items from a column of lists into separate columns. | |
upsample | Upsample a dataset given a weight column. | |
vectorize_dataset | Create a vectorized (numeric) dataset, (optionally) of reduced dimensionality | |
zeroshot_classify_text | Classify texts using custom labels/categories |