Detect near-duplicate documents using Jaccard similarity over word 3-grams (shingles).
Signature: def jaccard_duplicates(docs: list, threshold: float = 0.8) -> list
For each document:
For each pair (i, j) with i < j, compute Jaccard = |A ∩ B| / |A ∪ B| and return the list of (i, j) tuples whose similarity is >= threshold. If a document has fewer than 3 tokens its shingle set is empty; treat any pair involving an empty set as similarity 0.
Return pairs sorted ascending by (i, j).
Math
Asked at
Test Results