This paper present a method to detect near-duplicate page suitable for online query. It use simhash and hamming distance for similarity detection.
This article was published at the WWW 2007 conference. Paper’s authors are Gurmeet Manku (Google Inc.), Arvind Jain (Google Inc.), and Anish Das Sarma (Stanford University).
Near duplicate document detection is useful for many reasons including, improving search, clustering document and detecting spam. The technique presented focus on near detection of page web but I think that with little modification it can also been use in document clustering.
The paper is well written even if the section 3 requires a good knowledge of the area to be understandable. The key point of the paper is the combination of the simhash to create document fingerprint with the hamming distance to detect near fingerprint
simhash is a hash function because it reduce a high-dimensional vector to a fixed size fingerprint. At the opposite of cryptographic hash function where changing one vector dimension will complexity change the fingerprint, simhash is pretty conservative. The property that two near documents have close fingerprint make it suitable for near detection.
Hamming distance is computed on the set of fingerprints generated by simhash. In the paper they show that a distance of 3 is sufficient for near document detection. Hamming distance in this context means that they search fingerprints that differs from the submitted one for at most n bytes.
The technique used to tackle the hamming distance problem of finding hash that have a hamming distance of n bytes is clever and the experiment section is quite complete. The technique involve the computation of lookup table.
Finally the section 5 which is a survey of duplicate detection is not very useful because the content of the article is mainly directed to people that are already aware of the domain.
If you are looking for a way to detect near document (and duplicate of course), you should defiantly look at the technique described in this paper. If you seek a more simplistic one take a look at the Liechtenstein distance or the Jaro-Winkler distance.


Latest Comments