Skip to main content
Creates network links only if the row’s embeddings belong to the same group. E.g., text embeddings calculated for different languages are not necessarily compatible (even if they have the same dimension). Use this step if embeddings in different groups cannot be compared directly.

Usage

The following example shows how the step can be used in a recipe.

Examples

  • Example 1
  • Signature
To configure a minimum similarity between embeddings to create a link
link_grouped_embeddings(ds.embedding, ds.group, {
  "similarity_min": 0.7
}) -> (ds.targets, ds.weights)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").
embedding
column[list[number]]
required
A categorical column containing embeddings (numerical vectors/lists). Usually the result of previously executing a step embed_[entity].
grouping
column[category]
required
A categorical column identifying the groups whose embeddings are compatible.
targets
column
required
A column containing for each row a list of IDs (row numbers) identfying other rows it will be linked to.
weights
column
required
A column containing for each row a list of weights identfying the “importance” of each link to targets identified in the targets column.

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

n_nearest
integer
default:"15"
Number of nearest embeddings to take into account.Values must be in the following range:
1n_nearest < inf
similarity_min
number
default:"0"
Minimum similarity for connecting two nodes (similarity ∈ [0, 1]).Values must be in the following range:
0similarity_min1
similarity_min_q
number
default:"0"
Minimum similarity for connecting two nodes, expressed as a quantile of the similarity distribution (similarity ∈ [0, 1]).Values must be in the following range:
0similarity_min_q1
n_trees
integer
default:"30"
Number of trees. Affects the build time and the index size. A larger value will give more accurate results, but will take longer to create a larger index.
search_k_mult
integer
default:"2"
Accuracy multipler. A larger value will give more accurate results, but will take longer time to return.
metric
string
default:"angular"
Metric to use, only angular supported for now. Annoy’s angular metric is equivalent to sqrt(2*(1-cos(u,v))), whose max. is sqrt(2*2) = 2. I.e. the distance between (1,0) and (-1,0), at maximum angular separation, should be exactly 2 Note that for the weights of the resulting network links Annoy’s distances are converted to similarities in the interval [0,1].Values must be one of the following:
  • angular
  • euclidean
  • manhattan
  • hamming
  • dot
seed
number
Used to seed the random number generator, creating deterministic results.
I