TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
TorchedUp
LearnBetaProblemsSystem DesignSoonPremium
←

125. Near-Duplicate Detection (Jaccard)

Medium

Detect near-duplicate documents using Jaccard similarity over word 3-grams (shingles).

Signature: def jaccard_duplicates(docs: list, threshold: float = 0.8) -> list

For each document:

  1. Split on whitespace into tokens
  2. Form the set of overlapping 3-token shingles

For each pair (i, j) with i < j, compute Jaccard = |A ∩ B| / |A ∪ B| and return the list of (i, j) tuples whose similarity is >= threshold. If a document has fewer than 3 tokens its shingle set is empty; treat any pair involving an empty set as similarity 0.

Return pairs sorted ascending by (i, j).

Math

J(A,B)=∣A∪B∣∣A∩B∣​

Asked at

NumPy

import numpy as np

 

def jaccard_duplicates(...):

    pass

🔒

Premium problem

Free accounts include problems #1–20. Upgrade to unlock the editor, hidden test cases, and reference solutions for every problem.

Upgrade to PremiumBack to problems

Already premium?