Archive for May, 2007

May 31

Detecting near-duplicates for web crawling [Article Review]

This paper present a method to detect near-duplicate page suitable for online query. It use simhash and hamming distance for similarity detection.

This article was published at the WWW 2007 conference. Paper’s authors are Gurmeet Manku (Google Inc.), Arvind Jain (Google Inc.), and Anish Das Sarma (Stanford University).

Near duplicate document detection is useful for many reasons including, improving search, clustering document and detecting spam. The technique presented focus on near detection of page web but I think that with little modification it can also been use in document clustering.

The paper is well written even if the section 3 requires a good knowledge of the area to be understandable. The key point of the paper is the combination of the simhash to create document fingerprint with the hamming distance to detect near fingerprint

simhash is a hash function because it reduce a high-dimensional vector to a fixed size fingerprint. At the opposite of cryptographic hash function where changing one vector dimension will complexity change the fingerprint, simhash is pretty conservative. The property that two near documents have close fingerprint make it suitable for near detection.

Hamming distance is computed on the set of fingerprints generated by simhash. In the paper they show that a distance of 3 is sufficient for near document detection. Hamming distance in this context means that they search fingerprints that differs from the submitted one for at most n bytes.

The technique used to tackle the hamming distance problem of finding hash that have a hamming distance of n bytes is clever and the experiment section is quite complete. The technique involve the computation of lookup table.

Finally the section 5 which is a survey of duplicate detection is not very useful because the content of the article is mainly directed to people that are already aware of the domain.

If you are looking for a way to detect near document (and duplicate of course), you should defiantly look at the technique described in this paper. If you seek a more simplistic one take a look at the Liechtenstein distance or the Jaro-Winkler distance.

The article (pdf) / local copy

May 24

Silence on the Wire: A Field Guide to Passive Reconnaissance and Indirect Attacks [Book review]

Silence on the wire is the first book of M. Zalewski. However he is involved in computer security for a long time. I ve been reading his phrack articles for years, they are alway bright and very sharpe. He is famous for the vulnerabilities he discovers, his activity on security mailling list such as bugtrack and his study of TCP sequence prediction summarized in “strange attractor” (and maybe one day he might be famous as an aquarist :)). If you take the time to visit his homepage, you will see that he also have write a bunch of creative software. His most famous one is Passive Os Fingerprinter (POF) currently used in OpenBSD and various antispam and honeypot project.

I have read his book when it was release but never had the chance to review it so here we go:

I found the guide line of the book orignial and well done : You start from keyboard security to end up to Internet security following bits journey. This is well done and easy to follow. The style of writing is similar to his phrack article and therefor pleasant. The material are good however there is in some chapters a “deja-vu” sensation because it is merly the same content as the one found in phrack.

It is clearly a book by a security addict for security addicts. If you are knew to the field it is not the best way to start because it assume a lot of background. I particulary enjoy this mix between theroy and practice. This mix is quite unique in the security litterature I came accross and that is why I particulary recommand this book because, as author work it have this little thing that make it so special. My two favorite chapters are CHAPTER 1: I CAN HEAR YOU TYPING and CHAPTER 9: FOREIGN ACCENT.

If you are interesseted in security this is a book that you can’t aford to miss. It will give you as the author say ” Food for Thought”