We just spent our entire Sunday digging up server logs and conducting a post-mortem analysis on why our website (www.trialx.com) kept going down. According to pingdom.com, our site went down 61 times in the past 2 days! Our traffic numbers are very modest – we receive about a few hundred thousand visits per month – so we weren’t sure what was causing the frequent meltdowns.
Here is some background on our technology stack:
- Hardware: Amazon EC2 High CPU medium (c1.medium) instance with 5 EC2 compute units and 1.7GB RAM
- OS: Ubuntu (Kernel: 18.104.22.168-2.fc8xen # i686 GNU/Linux)
- App stack: Apache, mod_python, Django/Python and PHP, MySQL
It all started on Friday evening, when we got a server down alert.
The standard method we follow is to go to the Amazon EC2 console and reboot the server and everything goes back to normal. And it did work. However, every few hours the server kept going down. By Saturday morning, we started looking into the server logs and htop‘ing to see what was up? We noticed that MySQL and Apache were fighting for CPU at times causing the load to go above 1 at times. We decided it was now time to move the database to a separate server. Amazon Relational Data Service (RDS) seemed like a great alternative. The move involved firing up a new RDS instance (db.m1.small) and mysqldump’in data from current DB and piping to MySQL on RDS then changing the connection parameters. The entire move took less than 2 hours of down time. We were back up with RDS fire-power! It was 3:00AM Sunday morning then.
Unfortunately, the excitement did not last long. By 4:30am the server went down again. We rebooted the server and tail -f’ed the Apache server log and started looking at each and every IPs hitting our server. The major hitters were usual suspects: GoogleBot, Bingbot, Yandex, and so on. We isolated one nasty MSNbot (wtf?) which was hitting us almost 2-3 times per second. Not cool. we had to politely ban it in our iptables.
In afternoon, we headed to Chinatown to get some Dim sum at Jing Fong but there were about a half-a-billion folks waiting (no kidding, the place sits about 300 people and still there was an hour wait!).
After we came back, we started benchmarking using Apache AB tool to simulate loads to different URLs on the website. The server did not budge, even with 10 requests per second. We implemented caching on the PHP/Wordpress pages using WP Super cache. No effect. The server still kept going down.
We looked at the access logs at timestamps just before the server went down but still nothing stood out. We turned on the Mysql LOG slow queries in RDS. Again, nothing stood out. Some geo-IP based queries rending TrialX widget took about 1-2 seconds but these were very few.
Around 4:30pm, three of us were looking at the Apache access log, htops, slow-queries, Apache mod_status.
And suddenly, we noticed the load shooting up to 2, 3, 4, 10, 30…!! And Apache logs were showing all user-agent strings of “TwitterBot”, “TweetmemeBot”, “PostRank”, “PyCURL” and all of them accessing URLs from our blogs with same Google Analytics parameters:
The htop output looked something like the image below. Essentially, the server went from a load of almost 0 to 31 in a matter of few seconds. Even swap memory was exhausted. And the site crumbled like an egg shell shattered by a hammer!
We made a tentative hypothesis that Twitterfeed, a service that we use to blast out to FaceBook and our Twitter accounts was the culprit behind the massive load that was bringing the server to its knees.
To test the hypothesis, we again rebooted the server and set the Twitterfeed to post at given time and there you go, we started getting a flurry of all Twitter related bots accessing the same URL.
Below are the user-Agents hitting our server almost instantly after Twitterfeed update:
- PostRank/2.0 (postrank.com)
- SocialMedia Bot/1.5;
- Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp
- brainbot/1.0 (digitalbrain@brainstreammedia @digitalbrain
- EarlyEdd http://earlyedd.com – Alpha
- kame-rt (firstname.lastname@example.org
- UnwindFetchor/1.0 (+http://www.gnip.com/)
- JS-Kit URL Resolver, http://js-kit.com/
Moreover, each of these bots was accessing the last 5 newly published blogs and hence we were receiving about 16×5 = 80 requests within the span of 2-3 seconds after each update that went to Twitterfeed. It was almost as if the site was under a DDoS attack. So many bots hitting the server instantly was causing the load to shoot up and then a massive meltdow (see image below).
We now knew the issue was related to “Twitterfeed update”, but we were curios to test whether the real culprit was Twitterfeed or Twitter?
So we opened up our browsers and logged into our account on Twitter.com with 3 tabs and pasted a link to our website in the tweets. We simulated an update for 9 new URLs and there we saw again…. load went up… 2, 3…4…10... We looked at the logs and most of the same user-agents were hitting the server then.
We learnt (in a very painful way) that the real-time updates on Twitter was “inviting” a set of bots almost instantly to our website to visit a set of URLs that was bringing the server to its knees. This happens because all these bots visit within a small span of 2-3 seconds making hundreds of requests instead of a randomized visit patterns like Googlebot or Bingbot.
Now can this scenario have a broader implication. We certainly think so. Given Twitter’s power of being a real-time broadcast medium, its strength also makes it a potential vulnerability. With just a few tweets, all of which point to a specific target website, its possible for one to direct the Bots to simultanesouly ping the URL in the tweet and thus bring down a site.
We have stopped the Twitterfeed for now, until we figure out a caching solution for our new URLs. We certainly don’t want to stop Twitter updates, as it brings in decent traffic to our website. We may implement some robot rules for the Twitter Bot Army to behave politely when they visit us.
Please let us know if you have any suggestions to handle such traffic/load scenarios. We’d like to hear your solutions.
Engineering @ TrialX