Link embeddings¶
network • similarity
Create network links between rows/nodes calculating the similarity of embeddings (vectors).
Uses Spotify's Annoy to perform approximate nearest neighbour search.
Usage¶
The following are the step's expected inputs and outputs and their specific types.
link_embeddings(embedding: list[number], {"param": value}) -> (targets: column, weights: column)
where the object {"param": value}
is optional in most cases and if present may contain any of the parameters described in the
corresponding section below.
Example¶
To link similar embeddings with default configuration
link_embeddings(ds.embedding) -> (ds.targets, ds.weights)
More examples
To use a similarity cutoff below which similar embeddings won't be connected
link_embeddings(ds.embedding, {"similarity_min": 0.7}) -> (ds.targets, ds.weights)
Inputs¶
embedding: column:list[number]
A categorical column containing embeddings (numerical vectors/lists). Usually the result of previously executing a step embed_[entity].
Outputs¶
targets: column
A column containing for each row a list of IDs (row numbers) identfying other rows it will be linked to.
weights: column
A column containing for each row a list of weights identfying the "importance" of each link to
targets identified in the targets
column.
Parameters¶
Also see Spotify Annoy's page for details on parameter use.¶
n_nearest: integer = 15
Number of nearest neighbours to connect to.
Range: 1 ≤ n_nearest < inf
similarity_min: number = 0
Minimum similarity for connecting two nodes.
Range: 0 ≤ similarity_min ≤ 1
similarity_min_q: number = 0
Minimum similarity for connecting two nodes, expressed as a quantile of the similarity distribution.
Range: 0 ≤ similarity_min_q ≤ 1
n_trees: integer = 30
Number of trees. Affects the build time and the index size. A larger value will give more accurate results, but will take longer to create a larger index.
search_k_mult: integer = 2
Accuracy multipler. A larger value will give more accurate results, but will take longer time to return.
metric: string = "angular"
Metric to use, only angular supported for now. Annoy's angular metric is equivalent to sqrt(2(1-cos(u,v))), whose max. is sqrt(22) = 2. I.e. the distance between (1,0) and (-1,0), at maximum angular separation, should be exactly 2 Note that for the weights of the resulting network links Annoy's distances are converted to similarities in the interval [0,1].
Must be one of:
"angular"
,
"euclidean"
,
"manhattan"
,
"hamming"
,
"dot"
seed: number
Used to seed the random number generator, creating deterministic results.