Skip to main content
Creates a link between each row and the N rows most similar to it. Broadly, the similarity between two rows is a weighted similarity of the individual columns. The step accepts all data types, i.e. texts, quantitative, categorical columns etc.

Usage

The following example shows how the step can be used in a recipe.

Examples

  • Example 1
  • Signature
link_similar_rows(ds[["bio", "salary", "age", "department"]]) -> (links)

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").
ds
dataset
required
A dataset containing the columns to be included in the calculation of pair-wise similarities. Note: a subset of columns can always be selected in a recipe using the ds[[“column1”, “column2”, …]] syntax. Or to exclude: ds[![“column1”, “column2”, …]].
targets
column
required
A column containing for each row a list of row numbers identfying all other rows it is similar to.
weights
column
required
A column containing for each row a list of weights identfying the “importance” of each link to other rows identified in the targets column (identifying how similar the rows are).

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

n_similar_docs
integer
default:"10"
Number of similar docs.Values must be in the following range:
1n_similar_docs < inf
minhash
boolean
default:"false"
Whether to use minhash as a similarity measure.
n_terms
integer
default:"15"
Number of terms to use.Values must be in the following range:
1n_terms < inf
min_term_freq
integer
default:"2"
Minimum term frequency. (For TFIDF).Values must be in the following range:
0min_term_freq < inf
min_doc_freq
integer
default:"2"
Minimum doc frequency. (For TFIDF).Values must be in the following range:
0min_doc_freq < inf
max_doc_perc
number
default:"0.9"
Maximum doc percentage. (For TFIDF).Values must be in the following range:
0max_doc_perc1
min_should_match
integer
default:"2"
Minimum of terms that should match.Values must be in the following range:
0min_should_match < inf
separator
string
default:"[\\W0-9]{1,100}"
Regex to recognize as string separator.
stopwords_langs
string
default:"ES,EN"
Languages to use for stopwords. supports ES, EN and both using commas “ES,EN”.Values must be one of the following:
  • ES
  • EN
  • ES,EN
  • EN,ES
I