link_embeddings
Create network links between rows/nodes calculating the similarity of embeddings (vectors).
Uses Spotify’s Annoy to perform approximate nearest neighbour search.
Usage
The following examples show how the step can be used in a recipe.
To link similar embeddings with default configuration
Inputs & Outputs
The following are the inputs expected by the step and the outputs it produces. These are generally
columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Configuration
The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output)
.
Number of nearest neighbours to connect to.
Values must be in the following range:
Minimum similarity for connecting two nodes.
Values must be in the following range:
Minimum similarity for connecting two nodes, expressed as a quantile of the similarity distribution.
Values must be in the following range:
Number of trees. Affects the build time and the index size. A larger value will give more accurate results, but will take longer to create a larger index.
Accuracy multipler. A larger value will give more accurate results, but will take longer time to return.
Metric to use, only angular supported for now. Annoy’s angular metric is equivalent to sqrt(2*(1-cos(u,v))), whose max. is sqrt(2*2) = 2. I.e. the distance between (1,0) and (-1,0), at maximum angular separation, should be exactly 2 Note that for the weights of the resulting network links Annoy’s distances are converted to similarities in the interval [0,1].
Values must be one of the following:
angular
euclidean
manhattan
hamming
dot
Used to seed the random number generator, creating deterministic results.
Was this page helpful?