link_embeddings
Create network links between rows/nodes calculating the similarity of embeddings (vectors).
Uses Spotify’s Annoy to perform approximate nearest neighbour search.
A categorical column containing embeddings (numerical vectors/lists). Usually the result of previously executing a step embed_[entity].
A column containing for each row a list of IDs (row numbers) identfying other rows it will be linked to.
A column containing for each row a list of weights identfying the “importance” of each link to
targets identified in the targets
column.
Number of nearest neighbours to connect to.
Values must be in the following range:
1 ≤ n_nearest < inf
Minimum similarity for connecting two nodes.
Values must be in the following range:
0 ≤ similarity_min ≤ 1
Minimum similarity for connecting two nodes, expressed as a quantile of the similarity distribution.
Values must be in the following range:
0 ≤ similarity_min_q ≤ 1
Number of trees. Affects the build time and the index size. A larger value will give more accurate results, but will take longer to create a larger index.
Accuracy multipler. A larger value will give more accurate results, but will take longer time to return.
Metric to use, only angular supported for now. Annoy’s angular metric is equivalent to sqrt(2*(1-cos(u,v))), whose max. is sqrt(2*2) = 2. I.e. the distance between (1,0) and (-1,0), at maximum angular separation, should be exactly 2 Note that for the weights of the resulting network links Annoy’s distances are converted to similarities in the interval [0,1].
Values must be one of the following:
angular
euclidean
manhattan
hamming
dot
Used to seed the random number generator, creating deterministic results.
Was this page helpful?