Detect near-duplicate documents using Jaccard similarity over word 3-grams (shingles).
Signature: def jaccard_duplicates(docs: list, threshold: float = 0.8) -> list
For each document:
For each pair (i, j) with i < j, compute Jaccard = |A ∩ B| / |A ∪ B| and return the list of (i, j) tuples whose similarity is >= threshold. If a document has fewer than 3 tokens its shingle set is empty; treat any pair involving an empty set as similarity 0.
Return pairs sorted ascending by (i, j).
Math
Asked at
import numpy as np
def jaccard_duplicates(...):
pass
Premium problem
Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.
Already premium?