If you are experiencing performance issues with your site one of the reasons could be your Bandwidth is getting unnecessarily utilised by some bad bots. This article explains what are bad bots, how to identify them and how to stop bad bots crawling your website so that your genuine visitors are not affected.

What are bad bots upto?

There are quite a number of bad bots operating and the number keeps increasing rapidly. Most of them are from hackers trying to find some vulnerability in your code. They may be trying to get credit card numbers from an online store or may be scraping the text off an article and posting it on some random blog. They may also want to steal username and passwords of peole on your database hoping people use same credentials elsewhere which surprisingly they do. Some may just want to post spam comments on your website.

How to stop bad bots from crawling your website

Bots are supposed to obey the rules within your robots.txt file however bad bots simple ignore them. So just blocking them through robots.txt is not enough (although I recommend that you still do this step). The other way to block them is blocking their IP in your WAF. That too is not the most efficent way because they keep changing their IP address. Also the same IP could be used by some genuine user in case the IP is part of a network. In that case you could potentially block a genuine visitor too.

The recommended way to block is to check the User Agent of the requestor and block them on that basis. In some cases even that is not sufficient because the User Agent can also be faked to be coming from Google or MSN. In that case we need to carefully study other aspects of the request.

Let’s start with some basic understand about blocking or allowing robots/crawlers.

Basics about blocking/allowing bots

Using robots.txt

Do not allow any bot (user-agent) to access any part of your site

Allow any bot (user-agent) to access any part of your site

Do not allow bingbot to access any part of your site

Allow bingbot to access any part of your site

Allow bingbot to access your site, disallow it to access your wp-admin folder

Using Meta Tags:

Allow all bots to access your page(s)

Allow all bots to access your page(s) and follow links on the pages

Allow all bots to access your page(s) but do not allow them to follow links

Do not allow any bots to access your page(s)

Allow bingbot to access your page(s)

Do not allow Yahoo! Slurp to access your page(s)

Allow Yahoo! Slurp to access your page(s) and follow the links to more pages

Again bad bots can easily ignore these rules just like robots.txt. The only solution is to block them through htaccess.

Stop bad bots through HTACCESS:

Generally this method is not required for good bots but then why would you need to block them anyway.

This method is for the bad bots to forcefully stop them getting into your site although you have warned them through your robots.txt. Through htaccess we can block the bot with IP and with User Agent or both. We will look into both the methods. The most recommended method though is to block them by User Agent.

A. Blocking IPs

The above rule in your .htaccess file will block a range of IP

B. Block by User Agent

This is the most recommended way and a good starting point to stop bad bots.

I have listed some bad bots already in the above code. Each line checks for a particular bot User Agent.

NC means ignore spell check and OR means there are more rules coming which are to be taken into account.

The last line sends a 404 error page to the bots listed above.

Another way of achiving the same result is using below code. This requires mod_env apache module to be installed.

Advanced methods to stop bad bots

The above methods completes our basic protection against bad bots. There is just an additional layer of complication to what we have seen so far.

  • What if some XYZ bot pretends to be coming from Google?
  • Also not all bots which we are not aware aren’t bad. So how do we differetiate between good and bad bots to stop bad bots.

Some general guidelines to answer above questions

Study the crawling pattern of the crawlers. Check how aggressive they are.

Don’t assume that the aggressive ones will always be picked by your WAF. Some may be one step ahead of you. I mean they may  not be so aggressive so as to get picked by your WAF or they may be changing IPs too often or they may also be using one of your vulnerable plugins

Here is an article about fake google bots. Although it is old it is still applicable. On the contrary things have gone worse.

As per the report

For every 24 Googlebot visits a website will also be visited by a Fake Googlebot

Just recently I noticed that my cloudflare pro account started blocking google bot IPs. Please see below screenshot.

Stop bad bots

I was just about to unblock it but then I thought of checking further details. That time I came to know that the fake googlebots exist. Until this time I was blaming google for indexing my site too often and too aggressively. Below are the details of the fake google bot through nslookup.

Clearly the above bot is not a genuine google bot. Lot of bad bots use amazon server. Howerver that is definately not the deciding factor for being a good or a bad bot.

One way is to check using below command domain name pointer ec2-52-74-100-2.ap-southeast-1.compute.amazonaws.com.

while the one below returns domain name pointer crawl-66-249-66-1.googlebot.com.

which tells that the second one is a google bot as it is coming from googlebot.com

Leave a Reply

Your email address will not be published. Required fields are marked *