embed_dataset
Reduce the dataset to an n-dimensional numeric vector embedding.
Works like vectorize_dataset
, but instead of converting the input dataset to a new dataset of
N numeric columns, it creates a single column in the original dataset containing vectors (lists) of N components. In other words,
the result of vectorize_dataset
is converted to a column of embeddings, where each embedding is a numerical representation
of the corresponding row in the original dataset.
Many machine learning and AI algorithms expect their input data to be in pure numerical form, i.e. not containing categorical variables, missing values etc. This step converts arbitrary datasets, potentially containing non-numerical variables and NaNs, into this expected form. It does this by defining for each possible type of input column a transformation from non-numeric to numeric values. As an example, ordered categorical variables (ordinals) such as the day of week, may be converted into a series of numbers (0..7). Non-ordered categorical variables of low-cardinality (containing few different categories) may be expanded into multiple new columns of 0s and 1s, indicating whether each row belongs to a specific category or not. Similar transformations are applied to dates, multivalued categoricals etc.
NaNs are imputed (replaced) with an appropriate value from the corresponding column (e.g. the median in a quantitative column). In addition, a new component of 0s and 1s is added, indicating whether the original column had a missing value or not.
The resulting embeddings will almost certainly not contain the same number of components as the original dataset’s columns (as the example of categorical variables shows).
The n_components
parameter controls how many components the embeddings should have, and if this is smaller than would result
normally, a dimensionality reduction will be applied (UMAP by default).
The resulting numerical representation of the original data points aims to preserve the structure of similarities. I.e. if two original rows are similar to each other, than their (potentially reduced) numerical representations should also be similar. Equally, two very different rows should have representations that are also very different.
Was this page helpful?