Archive for the 'Internet' Category Page 2 of 7



Dec 07

Blog trackback Spam analysis

This friday, I present you my analysis on a botnet that spam blogs through the trackback/pingback mechanism. They try to abuse of blog trackback mechanism to improve their web ranking on search engine. I have been able to collect data about this botnet for around a year because this botnet is targeting my personal website.

It is useful to analyze the data collected because it allows to see how spam evolve over the time and how they do. This is quite a hot topic, and other blogs (see here and here), have entry on the subject. However to the best of my knowledge this post is the first that provide a complet technical analysis based on a vast amount of data.

I tried to analyze every aspect of this spam from the daily activity I monitor, to the type of machine involved, to the type of site that they are promoting. Without spoiling all the fun, I had to do a binary analysis of the file they are try to install on your pc.

Before getting into details, let take a look at the context of this spam. It is a very specific type of spam because it target blog and not mail. Therefore it input and output of this spam is quite different from the one you observe in email.

One key specificity of blog is that they aim at creating interaction between user but also between blogs. That the “blogsphere”. A common mechanism to allow interaction between blog is the trackback/pingback mechanism.

halo wordpress

The trackback Flood

The trackback mechanism is used to notify an author that you have make a link to one of their document. It enable authors to keep track of who is linking to, or referring to their article. It is also used to allow visitor to easily navigate between posts that are related to the same subject.

For example if you have a blog and speak about this article then you can send a trackback to let the author know about it. Your trackback will be added to the list of sites that refers to it.

Trackback specification is due to Six Apart who implemented it in its Movable Type on 2002. Since 2006 it is an IETF working group and will be one day a standard.

Finally trackback allows to generate traffic and optimize sites ranks. It will also make the author of the post happy to see that people find his work useful :)

Spammer found this later functionality very “useful”. They use it to optimize their search engine ranking and drag traffic to their site. It is appealing to them because the trackback mechanism allows to inject a link that points to their site into other blog in an automatic fashion.

While anti-spam techniques are interesting, I will not detail them on this post because it is a little bit out of the scope. If I have some request on the subject, I will write a detailed post about it.

The dotclear blog system version 1 implement a trackback mechanism in the file tb.php. A trackback is done by calling this file with the post id.

For example the url www.mysite.com/tb.php?id=18 is used to add a trackback to the post 18.

More technically a trackback is an HTTP POST request (Not a HTTP GET as It is commonly written) that submit four variables : the title, the url, the excerpt and the blog_name.

Here is a sample of a spammed trackback gathered by my trap page:

[title] => Cis transgender wiki
[url] => http://cis-transgender-wiki.spacenow ronomy.yyyy/index.html
[excerpt] => wiki Cis transgender wiki …
[blog_name] => Cis transgender wiki

My personal site bursztein.eu is the target of this type of spam for more than one year. I have keep a track of every requests since the beginning on a combined apache log. The file now exceed the 700MB of data. This give you an idea of how intense the spam flood can be.

More recently, I have also set up a trap page in order to log spam request. The trap page is used to log as many information as possible in an SQL backend. One of my friends call it a honeyblog :) And that pretty much what it is: a honeypot for blog trackback. It gives the impression to spammer that their trackback are really inserted to my web page, but instead I record and analyze them. I even have created an incomplete live report of the spam activity.

More specifically, I have created this honeyblog for two main purposes

  1. To be able to analyze the content of trackback since apache log do not log request content and maybe write a publication on it.
  2. To be able to generate a spammer IP list that can be use by anyone.

So how my personal site end up to be targeted by this kind a spam ? Well it is subject to this attack because at one point I have used dotclear for a couple of month before I decided to switch to Wordpress. During this period, I was probably added to the spammer target blog list and since then my site experience spam attempt on a regular basis.

Monthly activity

At this point, you probably wonder how bad the situation is. If it is only a couple or requests every day it should not be a big deal. Well It is a big deal ! look at the chart below that report the spam activity against my site for the last 12 months:

spamattemps

Note that the number of spam for December is a prevision. It seems quite accurate if you took a look at the current live report

Before trying to analyze the spammer objectives, Let’s take a closer look to an ordinary day of spamming to understand how the spamming plate form run. I choose the 4th December 2007 which was the first day of my honey blog.

Daily activity

On the 4th December 45798 different IP tried to add 60148 trackback spam on my site. This mean that my server had to generate 60148 page for nothing. Computer load analysis shows that before the installation of the honeyblog, spam in its peak, have consume up to 30% of my server cpu power.

Here are some of the most interesting statistics computed during the analysis.

user agent

First of all the user agent repartition. The user agent is a variable sent by the client to indicate which software it use. Back in 2006, spammer user agent where very standard: Internet Explorer, opera … But today they only use blog specific user agent : Wordpress. This make sense because they try to impersonate real blog trackback. Here is a little chart of the user agent repartition for the 4th December:

useragent

As you can see, They are not using the current version of Wordpress (2.3.x). This makes me believe that they are not continuously updating their software. It seems that a company is specialized in written this type of spam software. I won’t link to them of course for obvious reasons but if you wish checkout geek and fly blog Adam have a nice post about this soft.

trackback submitter comment spam blog mass link software

Spam activity

Next I was interested in the flood behavior. At the beginning I thought that if I graph spam activity by hour I will find a “loop”. This is consistent with the idea that they are flooding one site then an other and when the list is exhausted they start at the beginning of the list again. Since my blog is only one entry in their database I should have observed activity peak. However if you look at the chart below, you see that I was wrong.

spamdayactivity

Indeed there seems to be a period of 7 hours of activity but there is never under 1700 trackback posts by hour. The only explication for this repartition is that the entire set of flooder computer are coordinate to distribute their activity. This allows us to infers that the entire spam (or at least the most part) is the result of a single individual or company activity.

To observes the spam cycle more precisely, I might be possible to try to isolate one if activity. It then can allows to infers the size of the spammer target database. But this is more be for a research paper that for a blog post I guess.

Spam plate form

I wanted to confirm the hypothesis that the spam was coordinate by a single entity. If this hypothesis was true then the plate form (the set of computer) should be quite homogenous. To ensure that I run a couple of tests. Three of the most interesting results are the geolocalisation of computer, the type of OS runned on theses computers and their uptime.

Plate form Geolocalisation

First I wanted to know in which country the plate form was localized. To found it origin I have run the list of spammer IP against a IP geolocalization database. I used the webnet77 Free database. As I expected every 45798 IP belongs to a single country : Russia.

computerorigin

This is an additional good hint that the flood is performed by a single entity. A good and unanswered question is why a spam plate form located in Russia promote Chinese sites ?

Plateform OS

This result also rise an additional question : How many computer are really behind this set of IP. At this point It was quite likely that there where in fact few computer with many IP to by pass spam filter. To validate this new hypothesis I probed 100 random IP with nmap.

osrepatition

As you see the plate form OS are pretty consistent and singular. I have never expected that it will be FreeBSD. For those who dot know : FreeBSD is an UNIX but way less popular that LINUX (I am not discussing OS merit here). You probably have heard of it because OSX core is based on it. At this point I had a strong suspicion that there where only few computer in the plate form.

Computer uptime

To be sure I runned an other test. I have measure the uptime of each IP computer. See the repartition belows:

computeruptime

As you see there is only 6 different uptime for these 100 IP. Of course this measure need to be refined and extended to many more IP to be sure but It really tends to confirms that you have few computer because it very unlikely that two different computers have the same uptime. This repartition also indicates that the plate form is pretty solid because of the long uptime of some computer. It also prove that the spam is runned 24h a day.

Spam content

Of course I run a couple of basic analysis on the spam content. For this purpose I have used a corpus of 1139 spam samples. I have used standard text analysis technique to determine the prominent characteristics of the spam. I only details in this already to long post the results for the TITLE variable because other variables analysis does not differs very much.

In this 1139 titles there where 3936 words.

Statistical breakdown

Number of different words : 2041
Complexity factor (Lexical Density) : 51.9%
Readability (Gunning-Fog Index) : (6-easy 20-hard) 2.9Total number of characters : 28942
Number of characters without spaces : 22949
Average Syllables per Word : 1.86
Average title length (words) : 3.68
Ax title length (words) : 11 ( cu*m in here mouth she will spit it back in yours)
Min sentence length (words) : 1 ( sexyimages)
Readability : (100-easy 20-hard, optimal 60-70) 45.9

Well as you see title are short and have a pretty large lexical density. I never though that sex lexical was so large :) A very important analysis is the top word. It shows that black listing word will not work well because titles does not reuse the same sentence again and again. If you look at the word occurrences frequency rank below you will see that at most a word appears in 2.7% of the spam. and beside the 6 first word this percentage drop below 1%. (I have added extra * on word to avoid being flagged as p*or*n site)

Word Occurrences Frequency Rank

Word Occurrence % Rank
nud*e 106 2.7% 1
fr*ee 80 2% 2
se*x 76 1.9% 3
por*n 56 1.4% 4
pics 51 1.3% 5
na*ked 50 1.3% 5
video 42 1.1% 6
girls 31 0.8% 7
se*xy 31 0.8% 7
he*ntai 31 0.8% 7

2 word phrases frequency

Word Occurences %
free 36 0.9%
video 26 0.6%
po*rn 25 0.6%
nu*de 24 0.6%
pics 23 0.5%
na*ked 19 0.5%
se*x 19 0.5%

The money

Finally One question remains : How do they make money ? These people does not make this the beauty of the art, they make it for money. At first I thought they sell SEO product (Search Engine Optimization) but I was wrong. I went to one of the url submitted in the trackback (Do not do this at home that can be dangerous for your computer !). For those who wonder here is what you see on this site:

Screenshot

It was a pass through to a video site (por*no) (no link here again) that offers you to download a pulsing to view the video. Of course this pulsing is a spyware (told you not to go there). I know this because I run two tests on it.

First I used virustotal. This a cool service that allows you to upload a binary and it run every antivirus software on it. Some antivirus have found that the binary is in fact a spyware downloader:

Fichier setup.exe reçu le 2007.12.13 15:35:37 (CET)
Antivirus Version Dernière mise à jour Résultat
AhnLab-V3 2007.12.13.10 2007.12.12 -
AntiVir 7.6.0.40 2007.12.13 DR/Zlob.Gen
Authentium 4.93.8 2007.12.13 -
Avast 4.7.1098.0 2007.12.12 -
AVG 7.5.0.503 2007.12.13 Downloader.Zlob.LI
BitDefender 7.2 2007.12.13 -
CAT-QuickHeal 9.00 2007.12.12 -
ClamAV 0.91.2 2007.12.13 Trojan.Dropper-2529
DrWeb 4.44.0.09170 2007.12.13 Trojan.Popuper.origin
eSafe 7.0.15.0 2007.12.12 -
eTrust-Vet 31.3.5373 2007.12.13 -
Ewido 4.0 2007.12.13 -
FileAdvisor 1 2007.12.13 -
Fortinet 3.14.0.0 2007.12.13 -
F-Prot 4.4.2.54 2007.12.12 -
F-Secure 6.70.13030.0 2007.12.13 -
Ikarus T3.1.1.15 2007.12.13 -
Kaspersky 7.0.0.125 2007.12.13 -
McAfee 5184 2007.12.12 -
Microsoft 1.3007 2007.12.13 TrojanDownloader:Win32/Zlob.gen!dll
NOD32v2 2721 2007.12.13 -
Norman 5.80.02 2007.12.12 -
Panda 9.0.0.4 2007.12.13 -
Prevx1 V2 2007.12.13 -
Rising 20.22.32.00 2007.12.13 -
Sophos 4.24.0 2007.12.13 Troj/Zlobar-Fam
Sunbelt 2.2.907.0 2007.12.13 -
Symantec 10 2007.12.13 -
TheHacker 6.2.9.157 2007.12.12 -
VBA32 3.12.2.5 2007.12.10 -
VirusBuster 4.3.26:9 2007.12.12 -
Webwasher-Gateway 6.6.2 2007.12.13 Trojan.Dropper.Zlob.Gen
 
Information additionnelle
File size: 80367 bytes
MD5: 644801d4594f665cbf2f4b0ebf76b490
SHA1: ea6b20b7aedb2016930cf0f1aaf69a82ff04c2d0
PEiD: -

Then I run the anubis spyware analyzer. This is a research tool made by the seclab of Tuwien. This a terrific tool that performs a dynamic analysis on binary. It allows to know what the spyware will do on your computer without installing it. As you can see in the report, this spyware indeed install many files on the computer

Conclusion

First of all, if you made it so far: thanks you ! I know that is a very long post, but I didn’t feel that making shortcoming was an option. What started as a little experiment turn to be a quite interesting investigation, It might even turn at some point to a research paper indeed. Even if it does give all the in and out of trackback spam, I hope that it had give you an insight of how blog spam are used today to make profit.

See you next Friday for an other post that will be eventually shorter :)

Nov 14

Distributed Reflection DoS Attack

As often in security, a technique that appears as obvious deserves way more attention. Well “devil is in details”. DOS (Denial of service) and DDOS (Distributed Denial of Service) are in this category of technique. When I ask my students what a DDOS is, they are always saying something like “It is just a flood of TCP syn”. Well, it is true but it is just one form and not the most used one.

Reflection attack is a more complex and efficient form of DDOS because it use a distributed set of hosts as “bumper”. This make very hard to trace or deny such DDOS. The key idea is that it send SYN to hosts with a spoofed source (the victim) and the syn-ack or rst packets are sent back to the victim. Combine this with a random pattern and you have a pretty nasty technique. Note that as the opposite of smurf, “bumper” are not used as amplificator (well It is not completly accurate because of the probable TCP retransmission due to link congestion) but to make the source of packet to be unpredictable from the victim point of view. The advantage over simple SYN flood with source address is that the traffic is bumped via multiples routes making the process of tracking back the attack way more complex.

A very good survey about reflection attack and DDOS in general can be found here : GRC | The Distributed Reflection DoS Attack

Reflection diagrams taken from the Gibson Research Corporation paper