Skip to content

Link similar rows


Create network links calculating similarity between multidimensional and multitype documents.

Creates a link between each row and the N rows most similar to it. Broadly, the similarity between two rows is a weighted similarity of the individual columns. The step accepts all data types, i.e. texts, quantitative, categorical columns etc.


The following are the step's expected inputs and outputs and their specific types.

Step signature
link_similar_rows(ds: dataset, {
    "param": value
}) -> (targets: column, weights: column)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.


Example call (in recipe editor)
link_similar_rows(ds[["bio", "salary", "age", "department"]]) -> (links)


ds: dataset

A dataset containing the columns to be included in the calculation of pair-wise similarities. Note: a subset of columns can always be selected in a recipe using the ds[["column1", "column2", ...]] syntax. Or to exclude: ds[!["column1", "column2", ...]].


targets: column

A column containing for each row a list of row numbers identfying all other rows it is similar to.

weights: column

A column containing for each row a list of weights identfying the "importance" of each link to other rows identified in the targets column (identifying how similar the rows are).


n_similar_docs: integer = 10

Number of similar docs.

Range: 1 ≤ n_similar_docs < inf

minhash: boolean = False

Whether to use minhash as a similarity measure.

n_terms: integer = 15

Number of terms to use.

Range: 1 ≤ n_terms < inf

min_term_freq: integer = 2

Minimum term frequency. (For TFIDF).

Range: 0 ≤ min_term_freq < inf

min_doc_freq: integer = 2

Minimum doc frequency. (For TFIDF).

Range: 0 ≤ min_doc_freq < inf

max_doc_perc: number = 0.9

Maximum doc percentage. (For TFIDF).

Range: 0 ≤ max_doc_perc ≤ 1

min_should_match: integer = 2

Minimum of terms that should match.

Range: 0 ≤ min_should_match < inf

separator: string = "[\W0-9]{1,100}"

Regex to recognize as string separator.

stopwords_langs: string = "ES,EN"

Languages to use for stopwords. supports ES, EN and both using commas "ES,EN".

Must be one of: "ES", "EN", "ES,EN", "EN,ES"