Skip to content

Link similar rows

networksimilarity

Create network links calculating similarity between multidimensional and multitype documents.

Creates a link between each row and the N rows most similar to it. Broadly, the similarity between two rows is a weighted similarity of the individual columns. The step accepts all data types, i.e. texts, quantitative, categorical columns etc.

Usage


The following are the step's expected inputs and outputs and their specific types.

Step signature
link_similar_rows(ds: dataset, {
    "param": value
}) -> (targets: column, weights: column)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Example

Example call (in recipe editor)
link_similar_rows(ds[["bio", "salary", "age", "department"]]) -> (links)

Inputs


ds: dataset

A dataset containing the columns to be included in the calculation of pair-wise similarities. Note: a subset of columns can always be selected in a recipe using the ds[["column1", "column2", ...]] syntax. Or to exclude: ds[!["column1", "column2", ...]].

Outputs


targets: column

A column containing for each row a list of row numbers identfying all other rows it is similar to.


weights: column

A column containing for each row a list of weights identfying the "importance" of each link to other rows identified in the targets column (identifying how similar the rows are).

Parameters


n_similar_docs: integer = 10

Number of similar docs.

Range: 1 ≤ n_similar_docs < inf


minhash: boolean = False

Whether to use minhash as a similarity measure.


n_terms: integer = 15

Number of terms to use.

Range: 1 ≤ n_terms < inf


min_term_freq: integer = 2

Minimum term frequency. (For TFIDF).

Range: 0 ≤ min_term_freq < inf


min_doc_freq: integer = 2

Minimum doc frequency. (For TFIDF).

Range: 0 ≤ min_doc_freq < inf


max_doc_perc: number = 0.9

Maximum doc percentage. (For TFIDF).

Range: 0 ≤ max_doc_perc ≤ 1


min_should_match: integer = 2

Minimum of terms that should match.

Range: 0 ≤ min_should_match < inf


separator: string = "[\W0-9]{1,100}"

Regex to recognize as string separator.


stopwords_langs: string = "ES,EN"

Languages to use for stopwords. supports ES, EN and both using commas "ES,EN".

Must be one of: "ES", "EN", "ES,EN", "EN,ES"