Skip to content

Link embeddings

network · similarity

Create network links between rows/nodes calculating the similarity of embeddings (vectors).

Uses Spotify's Annoy to perform approximate nearest neighbour search.

Example

To link similar embeddings with default configuration

link_embeddings(ds.embedding) -> (links)
More examples

To use a similarity cutoff below which similar embeddings won't be connected

link_embeddings(ds.embedding, {"similarity_min": 0.7}) -> (links)

Usage

The following are the step's expected inputs and outputs and their specific types.

link_embeddings(embedding: list[number], {"param": value}) -> (links: dataset)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Inputs


embedding: column:list[number]

A categorical column containing embeddings (numerical vectors/lists). Usually the result of previously executing a step embed_[entity].

Outputs


links: dataset

A new dataset containing links (source, target and weight columns) connecting rows with similar embeddings.

Parameters


Also see Spotify Annoy's page for details on parameter use.


n_nearest: integer = 15

Number of nearest neighbours to connect to.

Range: 1 ≤ n_nearest < inf


similarity_min: number = 0

Minimum similarity for connecting two nodes.

Range: 0 ≤ similarity_min ≤ 1


similarity_min_q: number = 0

Minimum similarity for connecting two nodes, expressed as a quantile of the similarity distribution.

Range: 0 ≤ similarity_min_q ≤ 1


n_trees: integer = 30

Number of trees. Affects the build time and the index size. A larger value will give more accurate results, but will take longer to create a larger index.


search_k_mult: integer = 2

Accuracy multipler. A larger value will give more accurate results, but will take longer time to return.


metric: string = "angular"

Metric to use, only angular supported for now. Annoy's angular metric is equivalent to sqrt(2(1-cos(u,v))), whose max. is sqrt(22) = 2. I.e. the distance between (1,0) and (-1,0), at maximum angular separation, should be exactly 2 Note that for the weights of the resulting network links Annoy's distances are converted to similarities in the interval [0,1].

Must be one of: "angular", "euclidean", "manhattan", "hamming", "dot"


seed: number

Used to seed the random number generator, creating deterministic results.