Link similar rows¶
network • similarity
Create network links calculating similarity between multidimensional and multitype documents.
Creates a link between each row and the N rows most similar to it. Broadly, the similarity between two rows is a weighted similarity of the individual columns. The step accepts all data types, i.e. texts, quantitative, categorical columns etc.
Usage¶
The following are the step's expected inputs and outputs and their specific types.
link_similar_rows(ds: dataset, {"param": value}) -> (targets: column, weights: column)
where the object {"param": value}
is optional in most cases and if present may contain any of the parameters described in the
corresponding section below.
Example¶
link_similar_rows(ds[["bio", "salary", "age", "department"]]) -> (links)
Inputs¶
ds: dataset
A dataset containing the columns to be included in the calculation of pair-wise similarities. Note: a subset of columns can always be selected in a recipe using the ds[["column1", "column2", ...]] syntax. Or to exclude: ds[!["column1", "column2", ...]].
Outputs¶
targets: column
A column containing for each row a list of row numbers identfying all other rows it is similar to.
weights: column
A column containing for each row a list of weights identfying the "importance" of each
link to other rows identified in the targets
column (identifying how similar the rows are).
Parameters¶
n_similar_docs: integer = 10
Number of similar docs.
Range: 1 ≤ n_similar_docs < inf
minhash: boolean = False
Whether to use minhash as a similarity measure.
n_terms: integer = 15
Number of terms to use.
Range: 1 ≤ n_terms < inf
min_term_freq: integer = 2
Minimum term frequency. (For TFIDF).
Range: 0 ≤ min_term_freq < inf
min_doc_freq: integer = 2
Minimum doc frequency. (For TFIDF).
Range: 0 ≤ min_doc_freq < inf
max_doc_perc: number = 0.9
Maximum doc percentage. (For TFIDF).
Range: 0 ≤ max_doc_perc ≤ 1
min_should_match: integer = 2
Minimum of terms that should match.
Range: 0 ≤ min_should_match < inf
separator: string = "[\W0-9]{1,100}"
Regex to recognize as string separator.
stopwords_langs: string = "ES,EN"
Languages to use for stopwords. supports ES, EN and both using commas "ES,EN".
Must be one of:
"ES"
,
"EN"
,
"ES,EN"
,
"EN,ES"