MinHash+LSH: Fast Deduplication That Doesn't Understand Meaning

I spent last week deduplicating 10 million scraped articles(not really). Embeddings were too slow. Exact string matching missed near-duplicates. MinHash+LSH solved the lexical problem in seconds but completely missed semantic duplicates.

Here's what I learned about when this algorithm works and when it doesn't.

The Problem: Finding Duplicates at Scale

You have millions of documents. Many are duplicates or near-duplicates. You need to find them without comparing every pair (that's O(n²) — too slow).

Naive approaches:

Exact string matching: Only catches perfect copies. Misses "Hello world" vs "Hello world!"
Edit distance (Levenshtein): Accurate but slow. O(m×n) per comparison.
Embeddings + cosine similarity: Great for semantic similarity, but computing embeddings for 10M docs and comparing them is expensive.

What you want: Fast approximate matching that catches near-duplicates (copy-paste with minor edits) without expensive computation.

That's what MinHash+LSH provides.

How MinHash Works

MinHash (Minimum Hash) converts a document into a compact signature that preserves similarity.

The core idea:

Convert document to a set of shingles (n-grams)
Hash each shingle many times with different hash functions
For each hash function, keep the minimum hash value
The collection of minimums is your MinHash signature

Example:

Document: "The quick brown fox"

Shingles (3-grams):

["The quick brown", "quick brown fox"]

Apply 100 different hash functions to each shingle. For each hash function, keep the minimum:

# Simplified example with 3 hash functions
signature = [
  min(hash1("The quick brown"), hash1("quick brown fox")),  # 42
  min(hash2("The quick brown"), hash2("quick brown fox")),  # 89
  min(hash3("The quick brown"), hash3("quick brown fox")),  # 17
]
Signature: [42, 89, 17]

The math: If two documents have Jaccard similarity J, their MinHash signatures will match in approximately J fraction of positions.

Jaccard similarity:

J(A, B) = |A ∩ B| / |A ∪ B|

For sets A and B, it's the size of their intersection divided by their union.

Two documents with 80% overlapping shingles will have ~80% matching values in their MinHash signatures.

LSH: Making It Fast

MinHash gives you signatures. But you still need to compare them. That's where LSH (Locality-Sensitive Hashing) comes in.

The trick: Divide each signature into bands. Hash each band. Documents with matching band hashes are candidates.

Example:

Signature length: 100 Bands: 20 Rows per band: 5

Document A signature: [42, 89, 17, 23, 56, ...]
Document B signature: [42, 89, 17, 23, 56, ...]

Band 1 (rows 0-4): [42, 89, 17, 23, 56]
Band 2 (rows 5-9): [...]
...

If ANY band matches completely, A and B are candidate duplicates.

You don't compare all pairs. You only compare documents that hash to the same bucket in at least one band.

Result: O(n) instead of O(n²). For 10M documents, that's the difference between seconds and weeks.

What It Solves: Lexical Similarity

MinHash+LSH excels at finding documents that share the same words, even with minor edits.

Catches:

Copy-paste with typos:

Original: "The quick brown fox jumps over the lazy dog"
Duplicate: "The qiuck brown fox jumps over the lazy dog"
Jaccard: ~95%

Reordered paragraphs:

Doc A: [Para 1, Para 2, Para 3]
Doc B: [Para 2, Para 1, Para 3]
Jaccard: Still high (same shingles, different order)

Copy-paste with insertions:

Original: "Machine learning is transforming industries."
Duplicate: "Machine learning is rapidly transforming industries."
Jaccard: ~85%

Scraped content with different HTML:

Site A: <p>Article text here</p>
Site B: <div>Article text here</div>
After stripping HTML, shingles match.

Real example from my deduplication:

Found 2.3M near-duplicates in a corpus of 10M scraped news articles. Most were:

Syndicated content (same AP article on 50 different sites)
Copy-paste plagiarism with minor rewording
Different encodings of the same text (UTF-8 vs Latin-1 artifacts)

Processing time: ~4 minutes on a laptop.

What It Doesn't Solve: Semantic Similarity

MinHash+LSH has no concept of meaning. It only sees character sequences.

Misses completely:

❌ Paraphrases:

Doc A: "The company's revenue increased by 20% this quarter"
Doc B: "Quarterly earnings rose one-fifth compared to last period"
Jaccard: ~0% (almost no shared shingles)

❌ Translations:

Doc A: "Hello, how are you?"
Doc B: "Bonjour, comment allez-vous?"
Jaccard: 0%

❌ Synonyms:

Doc A: "The car is fast"
Doc B: "The automobile is quick"
Jaccard: ~20% (only "the" and "is" match)

❌ Summarization:

Original: [500-word article about climate change]
Summary: [50-word summary with same meaning]
Jaccard: Very low (summary uses different words)

Real failure from my pipeline:

Two articles about the same event:

Article A (Reuters): "Apple announced a new iPhone with improved camera
capabilities and longer battery life at their annual event in Cupertino"

Article B (TechCrunch): "The latest device from Cupertino reveals enhanced
photography features and extended power duration"

Same story. Zero overlap in my deduplication results. MinHash didn't flag them because they share almost no exact words.

The Fundamental Difference

Lexical similarity: Do documents share the same words?

Semantic similarity: Do documents share the same meaning?

MinHash+LSH solves the first. Embeddings solve the second.

Analogy:

MinHash is like checking if two essays copied from the same Wikipedia article. It finds plagiarism.

Embeddings are like checking if two essays answer the same question in their own words. They find conceptual overlap.

When to Use What

Use Case	Best Approach	Why
Find exact duplicates at scale	MinHash+LSH	Fast, catches typos and minor edits
Detect copy-paste plagiarism	MinHash+LSH	Identifies reused text sequences
Deduplicate scraped web content	MinHash+LSH	Most duplicates are syndicated content (same text)
Find semantically similar docs	Embeddings + ANN	Understands meaning, not just words
Cluster articles by topic	Embeddings	Groups by concept, not text overlap
Find paraphrases or rewrites	Embeddings	MinHash will miss them entirely

Cost comparison (10M documents):

MinHash+LSH: Minutes on a laptop, negligible API cost
Embeddings: Hours of compute or $500+ in API fees (OpenAI embeddings)

For my use case (deduplicating scraped news), MinHash caught 90% of duplicates. The remaining 10% required embeddings — but running embeddings on 10M docs wasn't worth it. I filtered to high-value content (100K docs) and ran embeddings only on those.

Implementation: The Basics

Using Python's datasketch library:

from datasketch import MinHash, MinHashLSH

# Create MinHash signatures
def create_minhash(text, num_perm=128):
    m = MinHash(num_perm=num_perm)
    # Use character 3-grams
    shingles = [text[i:i+3] for i in range(len(text) - 2)]
    for shingle in shingles:
        m.update(shingle.encode('utf-8'))
    return m

# Build LSH index
lsh = MinHashLSH(threshold=0.8, num_perm=128)

docs = [
    "The quick brown fox jumps over the lazy dog",
    "The qiuck brown fox jumps over the lazy dog",  # typo
    "A completely different sentence about cats"
]

for i, doc in enumerate(docs):
    m = create_minhash(doc)
    lsh.insert(f"doc_{i}", m)

# Query for duplicates
query = create_minhash(docs[0])
result = lsh.query(query)
print(result)  # ['doc_0', 'doc_1'] — finds the typo duplicate

Parameters that matter:

num_perm: More hash functions = more accurate but slower (128-256 is typical)
threshold: Jaccard similarity threshold (0.8 = 80% overlap required)
Shingle size: Larger = fewer false positives, but misses small edits (3-5 is common)

Limitations I Hit

1. Sensitive to preprocessing

Adding/removing whitespace changes shingles dramatically:

"hello world" → ["hel", "ell", "llo", "lo ", "o w", " wo", ...]
"hello  world" (two spaces) → Different shingles

Solution: Normalize whitespace, lowercase, strip punctuation before shingling.

2. Short documents are noisy

Documents under ~100 characters don't have enough shingles for reliable comparison. We set a minimum length threshold.

3. False positives with common boilerplate

Many scraped articles share headers/footers:

"Published on [date] | Share this article | Subscribe to newsletter"

If boilerplate is 30% of the document, you get false matches. Solution: Strip common templates before hashing.

4. No understanding of structure

Code duplicates are tricky:

# Version A
def add(a, b):
    return a + b

# Version B
def add(x, y):
    return x + y

Semantically identical. Lexically different (variable names). MinHash won't catch this without normalization.

The Hybrid Approach

What I ended up with:

Stage 1: MinHash+LSH (fast filter)

Deduplicate 10M docs → 7.7M unique (removed 2.3M exact duplicates)
Time: 4 minutes
Cost: $0

Stage 2: Embeddings (semantic filter on high-value subset)

Filter to important docs (100K)
Compute embeddings, cluster by semantic similarity
Merge clusters with >0.9 cosine similarity
Time: 20 minutes
Cost: ~$50 (OpenAI API)

Total: 95% deduplication accuracy in under 30 minutes.

Using embeddings alone on 10M docs would have taken hours and cost $500+. Using MinHash alone would have missed 5% of semantic duplicates.

The right answer was both.

Jaccard Verification: The Safety Check

Even after LSH flags candidates, you should verify.

LSH gives you candidate pairs. Compute actual Jaccard similarity to confirm:

def jaccard_similarity(set_a, set_b):
    intersection = len(set_a & set_b)
    union = len(set_a | set_b)
    return intersection / union if union > 0 else 0

# Verify candidates
for doc_a, doc_b in candidate_pairs:
    shingles_a = set(create_shingles(doc_a))
    shingles_b = set(create_shingles(doc_b))
    similarity = jaccard_similarity(shingles_a, shingles_b)

    if similarity > 0.8:
        mark_as_duplicate(doc_a, doc_b)

LSH is probabilistic. It can have false positives (especially with low band counts). Jaccard verification catches them.

The Takeaway

MinHash+LSH is excellent for what it does: finding documents that share the same words.

It's fast. It scales. It catches copy-paste, typos, and minor edits.

But it doesn't understand meaning. "The car is fast" and "The automobile is quick" are different to MinHash. If you need semantic deduplication, you need embeddings.

The trick is knowing which problem you're solving:

Lexical duplicates: Use MinHash+LSH
Semantic duplicates: Use embeddings
Both: Use MinHash first (fast filter), embeddings second (high-precision check on remaining candidates)

For my 10M document corpus, MinHash caught 90% of duplicates in 4 minutes. That's good enough for most pipelines.

The other 10%? That's where embeddings come in. But you don't need them for everything.