« Hebrew Bible Updated to Unicode, XHTML Strict | Main | Reading List: Sanity »

Saturday, July 21, 2018

Twitterbot is a Bad, Bad Boy

After I migrated the WordPress/BuddyPress site I administer, ratburger.org, to the Amazon Web Services (AWS) Linux 2 operating system platform on 2018-07-08, I observed intermittent errors in the system log reporting “php-fpm[21865]: [WARNING] [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 3 idle, and 27 total children” or some such. After correlating these with the HTTPD access_log, I found that they were due to the PHP-fpm mechanism (which is new in Linux 2) running out of worker processes or, even worse, launching so many of them it exhausts system memory and causes worker processes to crash. (And don't tell me to configure a swap file; that will only turn process crashes into system-wide thrashing oblivion.)

And why were all of these PHP processes running around? After all, this is a discussion site with fewer than 120 members and modest traffic. Looking at the log pointed the finger at Twitterbot, a Web crawler operated by the Californian socialist network called Twitter, which claims it's accessing sites to see if they provide “Twitter cards” for URLs posted on its system. Well, it's awfully frenetic in doing so. In the first incident I investigated, it hit my site from four different IP addresses (199.16.157.180-183) a total of 16 times within one second, all requesting the same page. You may call this a Web crawler. To me it looks like a denial of service attack. These requests will all spawn PHP-fpm worker processes and may blow away system memory, and for no reason. We do not support Twitter cards, and there is no conceivable reason for Twitter to make more than one request to determine we don't.

Enough is enough. I decided to tell Twitter to buzz (or flippy-flap) off. I added:

    User-agent: Twitterbot
    Disallow: /
to robots.txt and sat back to to see what would happen. Result? Essentially nothing: it continued to hit the site as before. All right, time to up the ante. I decided to consign Twitterbot to Stalag 403 with the following in .htaccess:
    # Block rogue user agents
    BrowserMatchNoCase 'Twitterbot' evilbots
    Order Allow,Deny
    Allow from ALL
    Deny from env=evilbots
so that any access from Twitterbot will receive a 403 and be informed that its access is forbidden and should not be retried. That ought to fix it, right?

Wrong.

In the last 24 hours there have been three request storms, all for /index.php, with 16 requests the first time and 18 on the second and third. All of these requests were sent within a period of one second, from four different IP addresses: 199.16.157.180-183. The second and third storms were 19 seconds apart, for a total of 36 hits within a period of less than 20 seconds.

For any site running PHP-fpm, this amounts to a denial of service attack: it will blow up the number of worker processes and possibly exhaust memory or start page thrashing and, in any case, delay legitimate user requests. Second, it isn't like the bot is crawling the site: it's making repeated requests for the same page over and over again, from four different IP addresses. Finally, it's violating HTTP protocol. A 403 status means the client has been forbidden access from the server, and the HTTP standard reads, “Authorization will not help and the request SHOULD NOT be repeated.” (capitals in the original). And yet in the third storm a single IP address hammered in 8 requests for the same page after having received a 403 on the first one. This is either exceptionally stupid or malicious, and I'm beginning to suspect the latter. I'm getting closer and closer to firewalling this IP range. This may break our anouncement of posts on Twitter, but at this point I'm not so sure that would be such a bad thing. The IP range is 199.16.157.180/30. Twitter's published outbound IP ranges are much larger: 199.16.156.0/22 and 199.59.148.0/22, but so far I've only seen Twitterbot coming from the four addresses in the first block.

I guess we shouldn't expect too much from a “social network” headquartered in a city now known for human feces and used addict needles on its sidewalks. (Hayek noted that any word in the English language is reduced in value by preceding it with “social”.) But once is happenstance, twice is coincidence, and three times is enemy action (Ian Fleming). Thirty-six times in twenty seconds? Welcome to my firewall.

(And note that these requests came from IPv4 address ranges which Twitter acknowledges are their own and were confirmed by WHOIS. So it's not somebody impersonating Twitterbot.)

By the way, if you're interested in intelligent, civil, and wide-ranging conversation, check out Ratburger,org. It's free; there are no advertisements, and no intrusive tracking. All members can post, comment, create and participate in interest groups, and join our weekly audio meet-up.

Posted at July 21, 2018 23:27