link_similar_rows
Create network links calculating similarity between multidimensional and multitype documents.
Creates a link between each row and the N rows most similar to it. Broadly, the similarity between two rows is a weighted similarity of the individual columns. The step accepts all data types, i.e. texts, quantitative, categorical columns etc.
Usage
The following example shows how the step can be used in a recipe.
Examples
Examples
General syntax for using the step in a recipe. Shows the inputs and outputs the step is expected to receive and will produce respectively. For futher details see sections below.
Inputs & Outputs
The following are the inputs expected by the step and the outputs it produces. These are generally
columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Inputs
Inputs
A dataset containing the columns to be included in the calculation of pair-wise similarities. Note: a subset of columns can always be selected in a recipe using the ds[[“column1”, “column2”, …]] syntax. Or to exclude: ds[![“column1”, “column2”, …]].
Outputs
Outputs
A column containing for each row a list of row numbers identfying all other rows it is similar to.
A column containing for each row a list of weights identfying the “importance” of each
link to other rows identified in the targets
column (identifying how similar the rows are).
Configuration
The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output)
.
Parameters
Parameters
Number of similar docs.
Values must be in the following range:
Whether to use minhash as a similarity measure.
Number of terms to use.
Values must be in the following range:
Minimum term frequency. (For TFIDF).
Values must be in the following range:
Minimum doc frequency. (For TFIDF).
Values must be in the following range:
Maximum doc percentage. (For TFIDF).
Values must be in the following range:
Minimum of terms that should match.
Values must be in the following range:
Regex to recognize as string separator.
Languages to use for stopwords. supports ES, EN and both using commas “ES,EN”.
Values must be one of the following:
ES
EN
ES,EN
EN,ES