link_similar_rows

Creates a link between each row and the N rows most similar to it. Broadly, the similarity between two rows is a weighted similarity of the individual columns. The step accepts all data types, i.e. texts, quantitative, categorical columns etc.

Usage

The following example shows how the step can be used in a recipe.

Examples

Inputs & Outputs

The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name), datasets (ds or ds[["first_name", "last_name"]]) or models (referenced by name e.g. "churn-clf").

Inputs

Outputs

Configuration

The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output).

Parameters

n_similar_docs

integer

default:"10"

Number of similar docs.Values must be in the following range:

1 ≤ n_similar_docs < inf

minhash

boolean

default:"false"

Whether to use minhash as a similarity measure.

n_terms

integer

default:"15"

Number of terms to use.Values must be in the following range:

1 ≤ n_terms < inf

min_term_freq

integer

default:"2"

Minimum term frequency. (For TFIDF).Values must be in the following range:

0 ≤ min_term_freq < inf

min_doc_freq

integer

default:"2"

Minimum doc frequency. (For TFIDF).Values must be in the following range:

0 ≤ min_doc_freq < inf

max_doc_perc

number

default:"0.9"

Maximum doc percentage. (For TFIDF).Values must be in the following range:

0 ≤ max_doc_perc ≤ 1

min_should_match

integer

default:"2"

Minimum of terms that should match.Values must be in the following range:

0 ≤ min_should_match < inf

separator

string

default:"[\\W0-9]{1,100}"

Regex to recognize as string separator.

stopwords_langs

string

default:"ES,EN"

Languages to use for stopwords. supports ES, EN and both using commas “ES,EN”.Values must be one of the following:

ES
EN
ES,EN
EN,ES

Prepare

Report

Analyse

Usage

Inputs & Outputs

Configuration

Prepare

Report

Analyse

​Usage

​Inputs & Outputs

​Configuration

Usage

Inputs & Outputs

Configuration