Creates a link between each row and the N rows most similar to it. Broadly, the similarity between two rows is a weighted similarity of the individual columns. The step accepts all data types, i.e. texts, quantitative, categorical columns etc.

n_similar_docs
integer
default: "10"

Number of similar docs.

Values must be in the following range:

1 ≤ n_similar_docs < inf
minhash
boolean

Whether to use minhash as a similarity measure.

n_terms
integer
default: "15"

Number of terms to use.

Values must be in the following range:

1 ≤ n_terms < inf
min_term_freq
integer
default: "2"

Minimum term frequency. (For TFIDF).

Values must be in the following range:

0 ≤ min_term_freq < inf
min_doc_freq
integer
default: "2"

Minimum doc frequency. (For TFIDF).

Values must be in the following range:

0 ≤ min_doc_freq < inf
max_doc_perc
number
default: "0.9"

Maximum doc percentage. (For TFIDF).

Values must be in the following range:

0 ≤ max_doc_perc ≤ 1
min_should_match
integer
default: "2"

Minimum of terms that should match.

Values must be in the following range:

0 ≤ min_should_match < inf
separator
string
default: "[\\W0-9]{1,100}"

Regex to recognize as string separator.

stopwords_langs
string
default: "ES,EN"

Languages to use for stopwords. supports ES, EN and both using commas “ES,EN”.

Values must be one of the following:

  • ES
  • EN
  • ES,EN
  • EN,ES