TorchedUp
ProblemsPremium
TorchedUp
Near-Duplicate Detection (Jaccard)Medium
ProblemsPremium

Near-Duplicate Detection

Detect near-duplicate documents using Jaccard similarity over word 3-grams (shingles).

Signature: def jaccard_duplicates(docs: list, threshold: float = 0.8) -> list

For each document:

  1. Split on whitespace into tokens
  2. Form the set of overlapping 3-token shingles

For each pair (i, j) with i < j, compute Jaccard = |A ∩ B| / |A ∪ B| and return the list of (i, j) tuples whose similarity is >= threshold. If a document has fewer than 3 tokens its shingle set is empty; treat any pair involving an empty set as similarity 0.

Return pairs sorted ascending by (i, j).

Math

Asked at

Python (numpy)0/3 runs today

Test Results

○identical docs
○no duplicates
○lower threshold🔒 Premium
Advertisement