link_similar_rows
Create network links calculating similarity between multidimensional and multitype documents.
Creates a link between each row and the N rows most similar to it. Broadly, the similarity between two rows is a weighted similarity of the individual columns. The step accepts all data types, i.e. texts, quantitative, categorical columns etc.
Usage
The following example shows how the step can be used in a recipe.
Inputs & Outputs
The following are the inputs expected by the step and the outputs it produces. These are generally
columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Configuration
The following parameters can be used to configure the behaviour of the step by including them in
a json object as the last “input” to the step, i.e. step(..., {"param": "value", ...}) -> (output)
.
Number of similar docs.
Values must be in the following range:
Whether to use minhash as a similarity measure.
Number of terms to use.
Values must be in the following range:
Minimum term frequency. (For TFIDF).
Values must be in the following range:
Minimum doc frequency. (For TFIDF).
Values must be in the following range:
Maximum doc percentage. (For TFIDF).
Values must be in the following range:
Minimum of terms that should match.
Values must be in the following range:
Regex to recognize as string separator.
Languages to use for stopwords. supports ES, EN and both using commas “ES,EN”.
Values must be one of the following:
ES
EN
ES,EN
EN,ES
Was this page helpful?