« Reading List: The Book Nobody Read | Main | Reading List: The New New Left »

Friday, December 2, 2005

Internet Slum: Referrer Pollution Attacks

Thanks to a wonderfuly insightful feedback message from Christopher Masto, who read the original report of what appeared at the time to be an odd distributed denial of service, attack, I now believe I understand what is going on.

The symptom remains the same: large numbers of identical Web server hits from a variety of sites (disproportionately in locations known to be the home of many spammers), each requesting the same Webalizer server status page and specifying as the referrer (the URL of the page supposedly containing the link to the document being fetched) what appear to be pages constructed to attract search engines to lists of commercial sites which, in fact, contain no link to the status page being requested. As I'm writing this, for example, two sites, one in Russia and the other in the Ukraine, are pumping in such HTTP requests at the rate of several packets per second, even though a version of Gardol configured to recognise the attack is is dispatching them directly to the bit bucket with iptables. (I've edited the following iptables status report to make it fit on the page, and hidden the second byte of the offending IP addresses to avoid giving them free publicity.)

   Chain INPUT (policy ACCEPT 2335M packets, 200G bytes)
    pkts bytes target  prot  in   out     source
   53014 2629K DROP    tcp   *    *       81.x.8.26
   35965 1774K DROP    tcp   *    *       195.x.176.138
What these sites appear to be doing is to crawl sites with a high page rank in Google and other search engines (Fourmilab has a very high page rank for some common and hence presumably valuable queries such as "earth", "sky", and "diet") looking for Webalizer-generated statistics pages. If they find them, they start blasting in zillions of hits on those pages, sometimes (but not always) not even bothering to read all the result data from the TCP connection, which causes the out of state TCP messages from the firewall which alerted me to this in the first place. The hits on the statistics pages, in turn, all specify as a referrer "Search engine poison" pages such as:
which are filled with obviously mechanically generated keywords and links to other pages which go to advertisers. (Like everything on the Internet, many such pages are pornographic or drug pedlars; I have listed only G rated exemplars here, although you never know where you may end up if you follow the links in them. To avoid giving these scum precisely what they're looking for—a link from my site, I have not provided a link or complete URL, and I have added the name of a fruit after each top level domain name. If you wish to see the content of one of these sites, remove the fruit and add the HTTP protocol specifier at the start.)

But why all the hits on the status pages, you ask? Well, most sites that run Webalizer use the default configuration, which includes a list of the top 30 referrers by URL. If, by flooding your site with requests, they can work their way into that list, then when your site status page is crawled by Googlebot, the referer link will be seen as a link from a highly ranked site to their trash, which is believed to drastically boost their own page ranking. The more highly ranked sites they pollute this way, the higher their own rank rises. You can see the sites they've hit by doing a Google search for these URLs and noting that they're almost all in referrer statistics.

Now, like the more conventional E-mail spammers, these guys are vandals who don't care in the least how much network bandwidth they squander and how much congestion they create on the outbound Internet connections of the victim sites they pound. They don't have to hit the statistics pages themselves, which tend to be large for active sites (mine are about 100K)—they could request something tiny and have the same effect on the referrer statistics. But ever more moronic is that before piling on a site they don't seem to check /robots.txt for a Web crawler restriction on the statistics pages such as mine:

    User-agent: *
    Disallow: /serverstats/
or for a document-specific exclusion in the statistics page itself, as the one I include:
    <meta name="robots" content="noindex,nofollow">
either of which will cause search engine Web crawlers to ignore the page. So all of the pounding on my site is of absolutely no benefit whatsoever to the arthropod apes who are doing it—even if they did manage to get their cretinous URLs into the list of top referrers (which they did before I configured Gardol to drop their packets), it won't help the page rank of their sites because Google, Yahoo, MSN and the rest don't index the statistics pages due to the robots exclusion.

Webalizer does not by default, however, include the "noindex" declaration, so to deter these bozos and reduce the probability of an attack, it's wise to include statistics pages in /robots.txt and/or add the requisite <meta> tag to each of them them with a declaration like:

    HTMLHead <meta name="robots" content="noindex,nofollow">
in the webalizer.conf file for the site.

All of this goes to show that when adding automatically generated content to a Web site, you should be as paranoid as Perl in exposing potentially "tainted" data from the outside. Even something as obscure as a list of top referrers in a server statistics page may be used as a billboard to promote somebody else's site, and your site subjected to callous abuse in order to so pollute it.

Posted at December 2, 2005 19:46