Deduplicating Documents at Scale
Deduplication sounds simple until you actually try to do it on real data. If you have a few thousand short texts, almost any solution works. When you move to millions of long documents, multiple languages, noisy data, and the requirement to find very similar (but not identical) documents, everything changes: algorithms, data structures, memory layout,…
