Archive for December, 2007

Dec 23

Go for understandability not readability

Readability analysis is used to tell how readable a text is. While very old, it just start to be used as a criteria for page analysis. In this post, I will try to explain why the goal of readability is misunderstanded. In particular for a web content analysis, what people want try to analyze is understandability and not readability.

Readability analysis

First let’s briefly look at what a readability analysis is. It is a set of algorithms designed to compute an index that allows to tell how hard a text is to read.

This was primary invented to ensure that school book can be read by kids.

The idea behind any readability test is very simple: long word and long sentence are more hard to read than shorter one. There is no magic in these criterion, they are purely syntactic.

To make it clearer, lets illustrate how work one of the most popular readability index : the gunning-fog index which have been invented in 1952. Note that is was designed to works against English text.

Gunning-fog principle

To evaluate the readability difficulty of a text the gunning-fog work like this :

  1. Find the average sentence length (divide the number of words by the number of sentences).
  2. Count words with three or more syllables (complex words), not including proper nouns (for example, Intel or Iphone), compound words, or common suffixes such as -es, -ed, or -ing as a syllable, or familiar jargon.
  3. Add the average sentence length and the percentage of complex words (ex., +13.37%, not simply + 0.1337)
  4. Multiply the result by 0.4

The gunning-fog formula is (taken from wikipedia):

f8e43b61343ad50f312e914c78e15727

 

This formula is pretty straightforward except the 0.4 number. This can be view as a “weight” in order to make a correspondence between school grades and index value. This is of course totally arbitrary.

Because they are very simple to implements, readability index have became quite popular now days. Many blogs post details them (here and here for example) and Web analysis ’smart’ tools use them such as websitegrader

Here is a screenshot of this blog analysis done by websitegrader (I really love this service):

rank

As you can see, this blog have a readability index of 4th grade. Indeed a 4th student can read it but However I doubt that a 4th grade student will be able to understand it. To emphases even more that if short words and sentences are easier to read, it does not mean that they are the most understandable; lets make a little example.

Gunning-fog example

Considers the two following texts reading index. I used textalyser.net to compute the gunning fox index (so you can do it at home if you dont believe me :) ). The first is taken from a cookie receipt (I love cookie :) ) by Peggy Trowbridge Filippone (Thanks for your columns :) ) :

“If you like pecan sandies, you’ll love this chocolate variation of a favorite cookie. These cookies are not overly sweet, yet have a rich chocolate flavor. Easy to make and great for shipping as gifts from the kitchen, since they hold up well.”

Text analyzer reports the following results for this text:

cookie

The second is taken from a graphic card review done by the very popular site Tom’s hardware (Love your reviews too :) ) :

“The GeForce 8800 GT 256 MB is a 256 MB version of the GeForce 8800 GT. The decreased memory capacity is also slower. With the official specifications at 700 MHz, the bandwidth is 22% lower. It’s a difference that’s bigger than that of the first generation of GeForce 8800 GTS (640 and 32 MB), but smaller than between the GeForce 7800 GTX and its 512 MB version”

cg

As you can see these two texts are completely equivalent from a syntactic point of view (Number of word, Average sentence length etc…) however if you look at this Gunning-fog index you see that the cookie receipt is harder to read that the graphic card review (8.8 versus 8.9). It might be harder to read but it is obvious that the technical article is harder to understand. Which is exactly my point.

You might also have notice that the readability index at the bottom give an opposed ranking, this is because it use a different formula (I dont know which one). But with the appropriate technical and receipt article it can still rank the receipt has harder than the technical one.

At this point I should have convinced you that, a readability index is not meant to tell you how hard to understand your text is. Or if you are involved in writing (in the broad sense: blogging, marketing, advertising …) this is exactly the metric you are looking for. You want something that tell you if your text can be understand by your target audience. So what you really want is not a readability analysis but an understandability analysis. If not such metric exist currently, it is a research hot topic. As a matter of fact a research article have been release on the subject a few weeks ago.

Understandability analysis

So how can we measure the understandability of a text ? Understanding a text means to understand the concept beyond the text. For example if I read a text about blog optimization for search engine, I need to understand the concept of Blog and Search Engine to be be able to understand the text. It is not syntactic, it is semantic. Hence an understandability criteria need to be semantic. That is why it is so difficult to make. Counting word or word length is easy, determining the meaning of a word or a sentence is not.

To conclude this post, lets take a look at some of the most promising taking to achieve understandability analysis

Lexical density

First of all trying to determine the understandability of a text by looking at each word is not relevant because many word such as “love, neat, great” can be view as “blank” for understandability. For example considers the two sentences:

“a great toy”

which is easy to understand and

“EXPTIME-complet is a great complexity”

which is not understandable unless you know about algorithmic complexity. In this context it is useful to take into account the lexical density of the text to isolate the words that are recurrent. A rule of thumb (I can’t remember who say it first) is that “the more a word appears in a text the more it is relevant for the text topic”. For example look at this Steve Job speech lexical density (taken from Russell’s blog). The bigger is the word, the more it has been said :

jobscloud

Even without knowing which speak it is, you can easily deduce that was about the iphone :) From the understandability point of view  Job’s speech requires to know at least what is a phone and an ipod are. If you don’t understand what a processor is, you probably be lost for a few slide but you sill get the key message.

Machine learning and word mapping

Now that we know which word we want to analyze, one question remains. How to estimate the level of understanding for a given word ? A way to do it is to use a machine learning algorithm (labeled). This is not has hard as it sound. You have already heard of machine learning algorithm: They are used to tell for spam filtering. They tell you if a text is a ham (legitimate text) or spam. The most widely used algorithm for this is the (naive) bayesian filter one.

it works by assigning to each word (token) an probability of being a spam and then compute for a given message it probability as the sum of each token probability. This is not completely accurate but this is the general principle. We say that it is a text classifier: it classify a text as spam or ham.

This classification can be extended to more that two categories spam and ham. We can imagine a Bayesian filter that classify text into categories such as : easy to understand, average, hard and specialized.

Before being able to classify a text the Bayesian classifier need to be trained. It need a corpus of text for every level of understandability wanted. This lead to three problems : first the classifier will be unable to classify a word if it haven’t see it before. For example a new coined term or a very technique term. Secondly it is language specific. For example the term “deja vu” is very common in french (trust me on this :) ) whereas “deja vu” is not that common in English (at least for a 6th grade kid). Finally creating text corpus is a huge burden to make.

Conclusion

Understadanability analysis is something that can be very handful for many people, from writer to bloger to advertiser, but at the opposite of readability criteria there is no simple mean to compute an index. If algorithm and principles are available as foundation, there will be sometime before it will be available.

Search engine can also use understandability analysis: wouldn’t be nice to be able to restrict search to specialized articles or kids articles ?

Dec 07

Blog trackback Spam analysis

This friday, I present you my analysis on a botnet that spam blogs through the trackback/pingback mechanism. They try to abuse of blog trackback mechanism to improve their web ranking on search engine. I have been able to collect data about this botnet for around a year because this botnet is targeting my personal website.

It is useful to analyze the data collected because it allows to see how spam evolve over the time and how they do. This is quite a hot topic, and other blogs (see here and here), have entry on the subject. However to the best of my knowledge this post is the first that provide a complet technical analysis based on a vast amount of data.

I tried to analyze every aspect of this spam from the daily activity I monitor, to the type of machine involved, to the type of site that they are promoting. Without spoiling all the fun, I had to do a binary analysis of the file they are try to install on your pc.

Before getting into details, let take a look at the context of this spam. It is a very specific type of spam because it target blog and not mail. Therefore it input and output of this spam is quite different from the one you observe in email.

One key specificity of blog is that they aim at creating interaction between user but also between blogs. That the “blogsphere”. A common mechanism to allow interaction between blog is the trackback/pingback mechanism.

halo wordpress

The trackback Flood

The trackback mechanism is used to notify an author that you have make a link to one of their document. It enable authors to keep track of who is linking to, or referring to their article. It is also used to allow visitor to easily navigate between posts that are related to the same subject.

For example if you have a blog and speak about this article then you can send a trackback to let the author know about it. Your trackback will be added to the list of sites that refers to it.

Trackback specification is due to Six Apart who implemented it in its Movable Type on 2002. Since 2006 it is an IETF working group and will be one day a standard.

Finally trackback allows to generate traffic and optimize sites ranks. It will also make the author of the post happy to see that people find his work useful :)

Spammer found this later functionality very “useful”. They use it to optimize their search engine ranking and drag traffic to their site. It is appealing to them because the trackback mechanism allows to inject a link that points to their site into other blog in an automatic fashion.

While anti-spam techniques are interesting, I will not detail them on this post because it is a little bit out of the scope. If I have some request on the subject, I will write a detailed post about it.

The dotclear blog system version 1 implement a trackback mechanism in the file tb.php. A trackback is done by calling this file with the post id.

For example the url www.mysite.com/tb.php?id=18 is used to add a trackback to the post 18.

More technically a trackback is an HTTP POST request (Not a HTTP GET as It is commonly written) that submit four variables : the title, the url, the excerpt and the blog_name.

Here is a sample of a spammed trackback gathered by my trap page:

[title] => Cis transgender wiki
[url] => http://cis-transgender-wiki.spacenow ronomy.yyyy/index.html
[excerpt] => wiki Cis transgender wiki …
[blog_name] => Cis transgender wiki

My personal site bursztein.eu is the target of this type of spam for more than one year. I have keep a track of every requests since the beginning on a combined apache log. The file now exceed the 700MB of data. This give you an idea of how intense the spam flood can be.

More recently, I have also set up a trap page in order to log spam request. The trap page is used to log as many information as possible in an SQL backend. One of my friends call it a honeyblog :) And that pretty much what it is: a honeypot for blog trackback. It gives the impression to spammer that their trackback are really inserted to my web page, but instead I record and analyze them. I even have created an incomplete live report of the spam activity.

More specifically, I have created this honeyblog for two main purposes

  1. To be able to analyze the content of trackback since apache log do not log request content and maybe write a publication on it.
  2. To be able to generate a spammer IP list that can be use by anyone.

So how my personal site end up to be targeted by this kind a spam ? Well it is subject to this attack because at one point I have used dotclear for a couple of month before I decided to switch to Wordpress. During this period, I was probably added to the spammer target blog list and since then my site experience spam attempt on a regular basis.

Monthly activity

At this point, you probably wonder how bad the situation is. If it is only a couple or requests every day it should not be a big deal. Well It is a big deal ! look at the chart below that report the spam activity against my site for the last 12 months:

spamattemps

Note that the number of spam for December is a prevision. It seems quite accurate if you took a look at the current live report

Before trying to analyze the spammer objectives, Let’s take a closer look to an ordinary day of spamming to understand how the spamming plate form run. I choose the 4th December 2007 which was the first day of my honey blog.

Daily activity

On the 4th December 45798 different IP tried to add 60148 trackback spam on my site. This mean that my server had to generate 60148 page for nothing. Computer load analysis shows that before the installation of the honeyblog, spam in its peak, have consume up to 30% of my server cpu power.

Here are some of the most interesting statistics computed during the analysis.

user agent

First of all the user agent repartition. The user agent is a variable sent by the client to indicate which software it use. Back in 2006, spammer user agent where very standard: Internet Explorer, opera … But today they only use blog specific user agent : Wordpress. This make sense because they try to impersonate real blog trackback. Here is a little chart of the user agent repartition for the 4th December:

useragent

As you can see, They are not using the current version of Wordpress (2.3.x). This makes me believe that they are not continuously updating their software. It seems that a company is specialized in written this type of spam software. I won’t link to them of course for obvious reasons but if you wish checkout geek and fly blog Adam have a nice post about this soft.

trackback submitter comment spam blog mass link software

Spam activity

Next I was interested in the flood behavior. At the beginning I thought that if I graph spam activity by hour I will find a “loop”. This is consistent with the idea that they are flooding one site then an other and when the list is exhausted they start at the beginning of the list again. Since my blog is only one entry in their database I should have observed activity peak. However if you look at the chart below, you see that I was wrong.

spamdayactivity

Indeed there seems to be a period of 7 hours of activity but there is never under 1700 trackback posts by hour. The only explication for this repartition is that the entire set of flooder computer are coordinate to distribute their activity. This allows us to infers that the entire spam (or at least the most part) is the result of a single individual or company activity.

To observes the spam cycle more precisely, I might be possible to try to isolate one if activity. It then can allows to infers the size of the spammer target database. But this is more be for a research paper that for a blog post I guess.

Spam plate form

I wanted to confirm the hypothesis that the spam was coordinate by a single entity. If this hypothesis was true then the plate form (the set of computer) should be quite homogenous. To ensure that I run a couple of tests. Three of the most interesting results are the geolocalisation of computer, the type of OS runned on theses computers and their uptime.

Plate form Geolocalisation

First I wanted to know in which country the plate form was localized. To found it origin I have run the list of spammer IP against a IP geolocalization database. I used the webnet77 Free database. As I expected every 45798 IP belongs to a single country : Russia.

computerorigin

This is an additional good hint that the flood is performed by a single entity. A good and unanswered question is why a spam plate form located in Russia promote Chinese sites ?

Plateform OS

This result also rise an additional question : How many computer are really behind this set of IP. At this point It was quite likely that there where in fact few computer with many IP to by pass spam filter. To validate this new hypothesis I probed 100 random IP with nmap.

osrepatition

As you see the plate form OS are pretty consistent and singular. I have never expected that it will be FreeBSD. For those who dot know : FreeBSD is an UNIX but way less popular that LINUX (I am not discussing OS merit here). You probably have heard of it because OSX core is based on it. At this point I had a strong suspicion that there where only few computer in the plate form.

Computer uptime

To be sure I runned an other test. I have measure the uptime of each IP computer. See the repartition belows:

computeruptime

As you see there is only 6 different uptime for these 100 IP. Of course this measure need to be refined and extended to many more IP to be sure but It really tends to confirms that you have few computer because it very unlikely that two different computers have the same uptime. This repartition also indicates that the plate form is pretty solid because of the long uptime of some computer. It also prove that the spam is runned 24h a day.

Spam content

Of course I run a couple of basic analysis on the spam content. For this purpose I have used a corpus of 1139 spam samples. I have used standard text analysis technique to determine the prominent characteristics of the spam. I only details in this already to long post the results for the TITLE variable because other variables analysis does not differs very much.

In this 1139 titles there where 3936 words.

Statistical breakdown

Number of different words : 2041
Complexity factor (Lexical Density) : 51.9%
Readability (Gunning-Fog Index) : (6-easy 20-hard) 2.9Total number of characters : 28942
Number of characters without spaces : 22949
Average Syllables per Word : 1.86
Average title length (words) : 3.68
Ax title length (words) : 11 ( cu*m in here mouth she will spit it back in yours)
Min sentence length (words) : 1 ( sexyimages)
Readability : (100-easy 20-hard, optimal 60-70) 45.9

Well as you see title are short and have a pretty large lexical density. I never though that sex lexical was so large :) A very important analysis is the top word. It shows that black listing word will not work well because titles does not reuse the same sentence again and again. If you look at the word occurrences frequency rank below you will see that at most a word appears in 2.7% of the spam. and beside the 6 first word this percentage drop below 1%. (I have added extra * on word to avoid being flagged as p*or*n site)

Word Occurrences Frequency Rank

Word Occurrence % Rank
nud*e 106 2.7% 1
fr*ee 80 2% 2
se*x 76 1.9% 3
por*n 56 1.4% 4
pics 51 1.3% 5
na*ked 50 1.3% 5
video 42 1.1% 6
girls 31 0.8% 7
se*xy 31 0.8% 7
he*ntai 31 0.8% 7

2 word phrases frequency

Word Occurences %
free 36 0.9%
video 26 0.6%
po*rn 25 0.6%
nu*de 24 0.6%
pics 23 0.5%
na*ked 19 0.5%
se*x 19 0.5%

The money

Finally One question remains : How do they make money ? These people does not make this the beauty of the art, they make it for money. At first I thought they sell SEO product (Search Engine Optimization) but I was wrong. I went to one of the url submitted in the trackback (Do not do this at home that can be dangerous for your computer !). For those who wonder here is what you see on this site:

Screenshot

It was a pass through to a video site (por*no) (no link here again) that offers you to download a pulsing to view the video. Of course this pulsing is a spyware (told you not to go there). I know this because I run two tests on it.

First I used virustotal. This a cool service that allows you to upload a binary and it run every antivirus software on it. Some antivirus have found that the binary is in fact a spyware downloader:

Fichier setup.exe reçu le 2007.12.13 15:35:37 (CET)
Antivirus Version Dernière mise à jour Résultat
AhnLab-V3 2007.12.13.10 2007.12.12 -
AntiVir 7.6.0.40 2007.12.13 DR/Zlob.Gen
Authentium 4.93.8 2007.12.13 -
Avast 4.7.1098.0 2007.12.12 -
AVG 7.5.0.503 2007.12.13 Downloader.Zlob.LI
BitDefender 7.2 2007.12.13 -
CAT-QuickHeal 9.00 2007.12.12 -
ClamAV 0.91.2 2007.12.13 Trojan.Dropper-2529
DrWeb 4.44.0.09170 2007.12.13 Trojan.Popuper.origin
eSafe 7.0.15.0 2007.12.12 -
eTrust-Vet 31.3.5373 2007.12.13 -
Ewido 4.0 2007.12.13 -
FileAdvisor 1 2007.12.13 -
Fortinet 3.14.0.0 2007.12.13 -
F-Prot 4.4.2.54 2007.12.12 -
F-Secure 6.70.13030.0 2007.12.13 -
Ikarus T3.1.1.15 2007.12.13 -
Kaspersky 7.0.0.125 2007.12.13 -
McAfee 5184 2007.12.12 -
Microsoft 1.3007 2007.12.13 TrojanDownloader:Win32/Zlob.gen!dll
NOD32v2 2721 2007.12.13 -
Norman 5.80.02 2007.12.12 -
Panda 9.0.0.4 2007.12.13 -
Prevx1 V2 2007.12.13 -
Rising 20.22.32.00 2007.12.13 -
Sophos 4.24.0 2007.12.13 Troj/Zlobar-Fam
Sunbelt 2.2.907.0 2007.12.13 -
Symantec 10 2007.12.13 -
TheHacker 6.2.9.157 2007.12.12 -
VBA32 3.12.2.5 2007.12.10 -
VirusBuster 4.3.26:9 2007.12.12 -
Webwasher-Gateway 6.6.2 2007.12.13 Trojan.Dropper.Zlob.Gen
 
Information additionnelle
File size: 80367 bytes
MD5: 644801d4594f665cbf2f4b0ebf76b490
SHA1: ea6b20b7aedb2016930cf0f1aaf69a82ff04c2d0
PEiD: -

Then I run the anubis spyware analyzer. This is a research tool made by the seclab of Tuwien. This a terrific tool that performs a dynamic analysis on binary. It allows to know what the spyware will do on your computer without installing it. As you can see in the report, this spyware indeed install many files on the computer

Conclusion

First of all, if you made it so far: thanks you ! I know that is a very long post, but I didn’t feel that making shortcoming was an option. What started as a little experiment turn to be a quite interesting investigation, It might even turn at some point to a research paper indeed. Even if it does give all the in and out of trackback spam, I hope that it had give you an insight of how blog spam are used today to make profit.

See you next Friday for an other post that will be eventually shorter :)