Usage
The following example shows how the step can be used in a recipe.Examples
Examples
Inputs & Outputs
The following are the inputs expected by the step and the outputs it produces. These are generally columns (ds.first_name
), datasets (ds
or ds[["first_name", "last_name"]]
) or models (referenced
by name e.g. "churn-clf"
).
Inputs
Inputs
A dataset containing the columns to be included in the calculation of pair-wise similarities.
Note: a subset of columns can always be selected in a recipe using the ds[[“column1”, “column2”, …]] syntax.
Or to exclude: ds[![“column1”, “column2”, …]].
Outputs
Outputs
A column containing for each row a list of row numbers identfying all other rows it is similar to.
A column containing for each row a list of weights identfying the “importance” of each
link to other rows identified in the
targets
column (identifying how similar the rows are).Configuration
The following parameters can be used to configure the behaviour of the step by including them in a json object as the last “input” to the step, i.e.step(..., {"param": "value", ...}) -> (output)
.
Parameters
Parameters
Number of similar docs.Values must be in the following range:
Whether to use minhash as a similarity measure.
Number of terms to use.Values must be in the following range:
Minimum term frequency. (For TFIDF).Values must be in the following range:
Minimum doc frequency. (For TFIDF).Values must be in the following range:
Maximum doc percentage. (For TFIDF).Values must be in the following range:
Minimum of terms that should match.Values must be in the following range:
Regex to recognize as string separator.
Languages to use for stopwords.
supports ES, EN and both using commas “ES,EN”.Values must be one of the following:
ES
EN
ES,EN
EN,ES