Skip to content

Link similar rows

network · similarity

Create network links calculating similarity between multidimensional and multitype documents.

Create network links calculating the similarity between pairs of rows.

Broadly, the similarity is a weighted similarity of the individual columns. Accepts all data types, i.e. texts, quantitative, categorical columns etc.

Creates a link between each row and the N rows most similar to it.

Example

link_similar_rows(ds[["bio", "salary", "age", "department"]]) -> (links)

Usage

The following are the step's expected inputs and outputs and their specific types.

link_similar_rows(data: dataset, {"param": value}) -> (links: dataset)

where the object {"param": value} is optional in most cases and if present may contain any of the parameters described in the corresponding section below.

Inputs


data: dataset

A Dataset containing the columns to be included in the calculation of pair-wise similarities. Note: a subset of columns can always be selected in a recipe using the ds[["column1", "column2", ...]] syntax. Or to exclude: ds[!["column1", "column2", ...]].

Outputs


links: dataset

A Dataset containing links (source, target and weight columns) between similar rows/nodes.

Parameters


n_similar_docs: integer = 10

Number of similar docs.

Range: 1 ≤ n_similar_docs < inf


minhash: boolean = False

Whether to use minhash as a similarity measure.


n_terms: integer = 15

Number of terms to use.

Range: 1 ≤ n_terms < inf


min_term_freq: integer = 2

Minimum term frequency. (For TFIDF).

Range: 0 ≤ min_term_freq < inf


min_doc_freq: integer = 2

Minimum doc frequency. (For TFIDF).

Range: 0 ≤ min_doc_freq < inf


max_doc_perc: number = 0.9

Maximum doc percentage. (For TFIDF).

Range: 0 ≤ max_doc_perc ≤ 1


min_should_match: integer = 2

Minimum of terms that should match.

Range: 0 ≤ min_should_match < inf


separator: string = "[\W0-9]{1,100}"

Regex to recognize as string separator.


stopwords_langs: string = "ES,EN"

Languages to use for stopwords. supports ES, EN and both using commas "ES,EN".

Must be one of: "ES", "EN", "ES,EN", "EN,ES"