Skip to content

Link grouped embeddings

network · similarity

Create network links calculating the similarity of embeddings (vectors) within groups.

Creates network links only if the row's embeddings belong to the same group.

E.g., text embeddings calculated for different languages are not necessarily compatible (even if they have the same dimension). Use this step if embeddings in different groups cannot be compared directly.

Example

To configure a minimum similarity between embeddings to create a link

link_grouped_embeddings(ds.embedding, ds.group, {
  "similarity_min": 0.7
}) -> (links)

Usage

The following are the step's expected inputs and outputs and their specific types.

link_grouped_embeddings(
    embedding: list[number],
    grouping: category, 
    {
        "param": value
    }
) -> (links: dataset)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Inputs


embedding: column:list[number]

A categorical column containing embeddings (numerical vectors/lists). Usually the result of previously executing a step embed_[entity].


grouping: column:category

A categorical column identifying the groups whose embeddings are compatible.

Outputs


links: dataset

A new dataset containing links (source, target and weight columns) connecting rows with similar embeddings.

Parameters


n_nearest: integer = 15

Number of nearest embeddings to take into account.

Range: 1 ≤ n_nearest < inf


similarity_min: number = 0

Minimum similarity for connecting two nodes (similarity ∈ [0, 1]).

Range: 0 ≤ similarity_min ≤ 1


similarity_min_q: number = 0

Minimum similarity for connecting two nodes, expressed as a quantile of the similarity distribution (similarity ∈ [0, 1]).

Range: 0 ≤ similarity_min_q ≤ 1


n_trees: integer = 30

Number of trees. Affects the build time and the index size. A larger value will give more accurate results, but will take longer to create a larger index.


search_k_mult: integer = 2

Accuracy multipler. A larger value will give more accurate results, but will take longer time to return.


metric: string = "angular"

Metric to use, only angular supported for now. Annoy's angular metric is equivalent to sqrt(2(1-cos(u,v))), whose max. is sqrt(22) = 2. I.e. the distance between (1,0) and (-1,0), at maximum angular separation, should be exactly 2 Note that for the weights of the resulting network links Annoy's distances are converted to similarities in the interval [0,1].

Must be one of: "angular", "euclidean", "manhattan", "hamming", "dot"


seed: number

Used to seed the random number generator, creating deterministic results.