Archive for the 'Internet' Category

Mar 04

What personnal data Facebook really send to external application ?

One of the Facebook key feature is the ability to add tons of custom applications developed by third party authors. They range from movie quiz to photo tagging to vampire fights. No doubt the idea is very cool but it rises the following privacy concern: what personal data about me an external application is able to get from Facebook ?

Facebook and the success of third party application

Third party applications undeniably contribute to Facebook success. The number
speaks for themselves.

Thousand of applications available

Currently the Facebook application directory list 17754 applications
that you can add to your profile.

Millions of installation

According to Inside Facebook more than 65 Millions of applications where added by users in the first month. This success is so huge that even design of the Facebook interface will evolve to add tabs to handle the increasing amount of information added by these applications. Here is a screenshot of the upcoming application:

]Bottom line: If you use Facebook, you use third party applications. Hence you should start to wonder what data you give to the application when you add it.

The Enrollment Process

The application enrollment process is straightforward. A single screen ask you if you authorize the application to get your data and in which part of your profile you want to place it:

As you can see, there is no information about the data you share with the application or a control mechanism.

The Facebook Application

In order to know, what data are accessible from a third party application, I
took a look at the Facebook API. It is publicly available here.

The most interesting part of the API is the Users.getInfo one. It allows the third party application to get data from the user. There is also a part of the API to work with Friend links but that’s an other story. When the application issues the User.getInfo query, Facebook return an XML file that contains user data.

Information available

So what’s in this XML file ? Your complete profile. Plain and simple. There is a little subtlety though. If you don’t have signup for the application the following data are not available:

  • meeting_for: list of desired relationship types corresponding to the “Looking For” profile element. If no relationship types are specified, the meeting_for element is empty. Otherwise represented as a list of seeking child text elements, which may each contain one of the following strings: Friendship, A Relationship, Dating, Random Play, Whatever I can get.
  • meeting_sex: list of desired relationship genders corresponding to the “Interested In” profile element. If no relationship genders are specified, the meeting_sex element is empty. Otherwise represented as a list of sex child text elements, which may each contain one of the following strings: male, female .
  • religion: User-entered “Religious Views” profile field. No guaranteed formatting.
  • significant_other_id: the id of the person the user is in a relationship with. Only shown if both people in the relationship are users of the application making the request.

Everything (Yes everything) is available regardless you have signup or not. Among these some can be very private:

  • current_location: User-entered “Current Location” profile fields. Contains four children: city, state, country, and zip.
  • education_history: list of school information, as education_info elements, each of which contain name, year, and concentration child elements.
  • relationship_status: User-entered “Relationship Status” profile field. Is either blank or one of the following strings: Single, In a Relationship, In an Open Relationship, Engaged, Married, It’s Complicated
  • work_history: List of work history information, as work_info elements, each of which contain location, company_name, position, description, start_date and end_date child elements. If no work history information is returned, this element is blank.
  • pic*: list of your profile picture.

Once again, If you have signup for the application, all the data are available to the third party, including the four mentioned in the beginning. I wonder what is the point to a movie quiz to know, if I am heterosexual or homosexual.

If you want to see the full list of the data sent take a look at this page.

What Facebook is doing for security ?

It is not surprising that with all this information available the Facebook API
is used for spam, hoax and so on. So what do Facebook ? It tries to restrict
developer power, by adding rate limit and making prominent the link for abuse.
Yeah but that is too late, data are already in the third party database …

Edit 03/22: It seems that facebook now offers a way more advanced to filter personnal informations. That is a very good thing (Does Facebook developper read my blog ? :) ).

A better control of privacy ?

So what can we do ? The solution exists, at least in research lab, it is called selective access control for XML documents. The underlying idea is to provide various level of access to the same XML file. In our Facebook case it could be used to create various level based on the nature of the application. For instance a quiz application does not need to know where I work or if I am married.

If you would like to know more about selective control access, you can read “Securing XML Documents” (pdf) by E. Damiani, S. De Capitani di Vimercati, S. Paraboschi, P. Sarnarati

Dec 07

Blog trackback Spam analysis

This friday, I present you my analysis on a botnet that spam blogs through the trackback/pingback mechanism. They try to abuse of blog trackback mechanism to improve their web ranking on search engine. I have been able to collect data about this botnet for around a year because this botnet is targeting my personal website.

It is useful to analyze the data collected because it allows to see how spam evolve over the time and how they do. This is quite a hot topic, and other blogs (see here and here), have entry on the subject. However to the best of my knowledge this post is the first that provide a complet technical analysis based on a vast amount of data.

I tried to analyze every aspect of this spam from the daily activity I monitor, to the type of machine involved, to the type of site that they are promoting. Without spoiling all the fun, I had to do a binary analysis of the file they are try to install on your pc.

Before getting into details, let take a look at the context of this spam. It is a very specific type of spam because it target blog and not mail. Therefore it input and output of this spam is quite different from the one you observe in email.

One key specificity of blog is that they aim at creating interaction between user but also between blogs. That the “blogsphere”. A common mechanism to allow interaction between blog is the trackback/pingback mechanism.

halo wordpress

The trackback Flood

The trackback mechanism is used to notify an author that you have make a link to one of their document. It enable authors to keep track of who is linking to, or referring to their article. It is also used to allow visitor to easily navigate between posts that are related to the same subject.

For example if you have a blog and speak about this article then you can send a trackback to let the author know about it. Your trackback will be added to the list of sites that refers to it.

Trackback specification is due to Six Apart who implemented it in its Movable Type on 2002. Since 2006 it is an IETF working group and will be one day a standard.

Finally trackback allows to generate traffic and optimize sites ranks. It will also make the author of the post happy to see that people find his work useful :)

Spammer found this later functionality very “useful”. They use it to optimize their search engine ranking and drag traffic to their site. It is appealing to them because the trackback mechanism allows to inject a link that points to their site into other blog in an automatic fashion.

While anti-spam techniques are interesting, I will not detail them on this post because it is a little bit out of the scope. If I have some request on the subject, I will write a detailed post about it.

The dotclear blog system version 1 implement a trackback mechanism in the file tb.php. A trackback is done by calling this file with the post id.

For example the url www.mysite.com/tb.php?id=18 is used to add a trackback to the post 18.

More technically a trackback is an HTTP POST request (Not a HTTP GET as It is commonly written) that submit four variables : the title, the url, the excerpt and the blog_name.

Here is a sample of a spammed trackback gathered by my trap page:

[title] => Cis transgender wiki
[url] => http://cis-transgender-wiki.spacenow ronomy.yyyy/index.html
[excerpt] => wiki Cis transgender wiki …
[blog_name] => Cis transgender wiki

My personal site bursztein.eu is the target of this type of spam for more than one year. I have keep a track of every requests since the beginning on a combined apache log. The file now exceed the 700MB of data. This give you an idea of how intense the spam flood can be.

More recently, I have also set up a trap page in order to log spam request. The trap page is used to log as many information as possible in an SQL backend. One of my friends call it a honeyblog :) And that pretty much what it is: a honeypot for blog trackback. It gives the impression to spammer that their trackback are really inserted to my web page, but instead I record and analyze them. I even have created an incomplete live report of the spam activity.

More specifically, I have created this honeyblog for two main purposes

  1. To be able to analyze the content of trackback since apache log do not log request content and maybe write a publication on it.
  2. To be able to generate a spammer IP list that can be use by anyone.

So how my personal site end up to be targeted by this kind a spam ? Well it is subject to this attack because at one point I have used dotclear for a couple of month before I decided to switch to Wordpress. During this period, I was probably added to the spammer target blog list and since then my site experience spam attempt on a regular basis.

Monthly activity

At this point, you probably wonder how bad the situation is. If it is only a couple or requests every day it should not be a big deal. Well It is a big deal ! look at the chart below that report the spam activity against my site for the last 12 months:

spamattemps

Note that the number of spam for December is a prevision. It seems quite accurate if you took a look at the current live report

Before trying to analyze the spammer objectives, Let’s take a closer look to an ordinary day of spamming to understand how the spamming plate form run. I choose the 4th December 2007 which was the first day of my honey blog.

Daily activity

On the 4th December 45798 different IP tried to add 60148 trackback spam on my site. This mean that my server had to generate 60148 page for nothing. Computer load analysis shows that before the installation of the honeyblog, spam in its peak, have consume up to 30% of my server cpu power.

Here are some of the most interesting statistics computed during the analysis.

user agent

First of all the user agent repartition. The user agent is a variable sent by the client to indicate which software it use. Back in 2006, spammer user agent where very standard: Internet Explorer, opera … But today they only use blog specific user agent : Wordpress. This make sense because they try to impersonate real blog trackback. Here is a little chart of the user agent repartition for the 4th December:

useragent

As you can see, They are not using the current version of Wordpress (2.3.x). This makes me believe that they are not continuously updating their software. It seems that a company is specialized in written this type of spam software. I won’t link to them of course for obvious reasons but if you wish checkout geek and fly blog Adam have a nice post about this soft.

trackback submitter comment spam blog mass link software

Spam activity

Next I was interested in the flood behavior. At the beginning I thought that if I graph spam activity by hour I will find a “loop”. This is consistent with the idea that they are flooding one site then an other and when the list is exhausted they start at the beginning of the list again. Since my blog is only one entry in their database I should have observed activity peak. However if you look at the chart below, you see that I was wrong.

spamdayactivity

Indeed there seems to be a period of 7 hours of activity but there is never under 1700 trackback posts by hour. The only explication for this repartition is that the entire set of flooder computer are coordinate to distribute their activity. This allows us to infers that the entire spam (or at least the most part) is the result of a single individual or company activity.

To observes the spam cycle more precisely, I might be possible to try to isolate one if activity. It then can allows to infers the size of the spammer target database. But this is more be for a research paper that for a blog post I guess.

Spam plate form

I wanted to confirm the hypothesis that the spam was coordinate by a single entity. If this hypothesis was true then the plate form (the set of computer) should be quite homogenous. To ensure that I run a couple of tests. Three of the most interesting results are the geolocalisation of computer, the type of OS runned on theses computers and their uptime.

Plate form Geolocalisation

First I wanted to know in which country the plate form was localized. To found it origin I have run the list of spammer IP against a IP geolocalization database. I used the webnet77 Free database. As I expected every 45798 IP belongs to a single country : Russia.

computerorigin

This is an additional good hint that the flood is performed by a single entity. A good and unanswered question is why a spam plate form located in Russia promote Chinese sites ?

Plateform OS

This result also rise an additional question : How many computer are really behind this set of IP. At this point It was quite likely that there where in fact few computer with many IP to by pass spam filter. To validate this new hypothesis I probed 100 random IP with nmap.

osrepatition

As you see the plate form OS are pretty consistent and singular. I have never expected that it will be FreeBSD. For those who dot know : FreeBSD is an UNIX but way less popular that LINUX (I am not discussing OS merit here). You probably have heard of it because OSX core is based on it. At this point I had a strong suspicion that there where only few computer in the plate form.

Computer uptime

To be sure I runned an other test. I have measure the uptime of each IP computer. See the repartition belows:

computeruptime

As you see there is only 6 different uptime for these 100 IP. Of course this measure need to be refined and extended to many more IP to be sure but It really tends to confirms that you have few computer because it very unlikely that two different computers have the same uptime. This repartition also indicates that the plate form is pretty solid because of the long uptime of some computer. It also prove that the spam is runned 24h a day.

Spam content

Of course I run a couple of basic analysis on the spam content. For this purpose I have used a corpus of 1139 spam samples. I have used standard text analysis technique to determine the prominent characteristics of the spam. I only details in this already to long post the results for the TITLE variable because other variables analysis does not differs very much.

In this 1139 titles there where 3936 words.

Statistical breakdown

Number of different words : 2041
Complexity factor (Lexical Density) : 51.9%
Readability (Gunning-Fog Index) : (6-easy 20-hard) 2.9Total number of characters : 28942
Number of characters without spaces : 22949
Average Syllables per Word : 1.86
Average title length (words) : 3.68
Ax title length (words) : 11 ( cu*m in here mouth she will spit it back in yours)
Min sentence length (words) : 1 ( sexyimages)
Readability : (100-easy 20-hard, optimal 60-70) 45.9

Well as you see title are short and have a pretty large lexical density. I never though that sex lexical was so large :) A very important analysis is the top word. It shows that black listing word will not work well because titles does not reuse the same sentence again and again. If you look at the word occurrences frequency rank below you will see that at most a word appears in 2.7% of the spam. and beside the 6 first word this percentage drop below 1%. (I have added extra * on word to avoid being flagged as p*or*n site)

Word Occurrences Frequency Rank

Word Occurrence % Rank
nud*e 106 2.7% 1
fr*ee 80 2% 2
se*x 76 1.9% 3
por*n 56 1.4% 4
pics 51 1.3% 5
na*ked 50 1.3% 5
video 42 1.1% 6
girls 31 0.8% 7
se*xy 31 0.8% 7
he*ntai 31 0.8% 7

2 word phrases frequency

Word Occurences %
free 36 0.9%
video 26 0.6%
po*rn 25 0.6%
nu*de 24 0.6%
pics 23 0.5%
na*ked 19 0.5%
se*x 19 0.5%

The money

Finally One question remains : How do they make money ? These people does not make this the beauty of the art, they make it for money. At first I thought they sell SEO product (Search Engine Optimization) but I was wrong. I went to one of the url submitted in the trackback (Do not do this at home that can be dangerous for your computer !). For those who wonder here is what you see on this site:

Screenshot

It was a pass through to a video site (por*no) (no link here again) that offers you to download a pulsing to view the video. Of course this pulsing is a spyware (told you not to go there). I know this because I run two tests on it.

First I used virustotal. This a cool service that allows you to upload a binary and it run every antivirus software on it. Some antivirus have found that the binary is in fact a spyware downloader:

Fichier setup.exe reçu le 2007.12.13 15:35:37 (CET)
Antivirus Version Dernière mise à jour Résultat
AhnLab-V3 2007.12.13.10 2007.12.12 -
AntiVir 7.6.0.40 2007.12.13 DR/Zlob.Gen
Authentium 4.93.8 2007.12.13 -
Avast 4.7.1098.0 2007.12.12 -
AVG 7.5.0.503 2007.12.13 Downloader.Zlob.LI
BitDefender 7.2 2007.12.13 -
CAT-QuickHeal 9.00 2007.12.12 -
ClamAV 0.91.2 2007.12.13 Trojan.Dropper-2529
DrWeb 4.44.0.09170 2007.12.13 Trojan.Popuper.origin
eSafe 7.0.15.0 2007.12.12 -
eTrust-Vet 31.3.5373 2007.12.13 -
Ewido 4.0 2007.12.13 -
FileAdvisor 1 2007.12.13 -
Fortinet 3.14.0.0 2007.12.13 -
F-Prot 4.4.2.54 2007.12.12 -
F-Secure 6.70.13030.0 2007.12.13 -
Ikarus T3.1.1.15 2007.12.13 -
Kaspersky 7.0.0.125 2007.12.13 -
McAfee 5184 2007.12.12 -
Microsoft 1.3007 2007.12.13 TrojanDownloader:Win32/Zlob.gen!dll
NOD32v2 2721 2007.12.13 -
Norman 5.80.02 2007.12.12 -
Panda 9.0.0.4 2007.12.13 -
Prevx1 V2 2007.12.13 -
Rising 20.22.32.00 2007.12.13 -
Sophos 4.24.0 2007.12.13 Troj/Zlobar-Fam
Sunbelt 2.2.907.0 2007.12.13 -
Symantec 10 2007.12.13 -
TheHacker 6.2.9.157 2007.12.12 -
VBA32 3.12.2.5 2007.12.10 -
VirusBuster 4.3.26:9 2007.12.12 -
Webwasher-Gateway 6.6.2 2007.12.13 Trojan.Dropper.Zlob.Gen
 
Information additionnelle
File size: 80367 bytes
MD5: 644801d4594f665cbf2f4b0ebf76b490
SHA1: ea6b20b7aedb2016930cf0f1aaf69a82ff04c2d0
PEiD: -

Then I run the anubis spyware analyzer. This is a research tool made by the seclab of Tuwien. This a terrific tool that performs a dynamic analysis on binary. It allows to know what the spyware will do on your computer without installing it. As you can see in the report, this spyware indeed install many files on the computer

Conclusion

First of all, if you made it so far: thanks you ! I know that is a very long post, but I didn’t feel that making shortcoming was an option. What started as a little experiment turn to be a quite interesting investigation, It might even turn at some point to a research paper indeed. Even if it does give all the in and out of trackback spam, I hope that it had give you an insight of how blog spam are used today to make profit.

See you next Friday for an other post that will be eventually shorter :)