Stop bad bots from crawling your website

Stop bad bots from crawling your website

15210144673_b37c806986_z-e1467018790709

If you are experiencing performance issues with your site one of the reasons could be your Bandwidth is getting unnecessarily utilised by some bad bots. This article explains what are bad bots, how to identify them and how to stop bad bots crawling your website so that your genuine visitors are not affected.

What are bad bots upto?

There are quite a number of bad bots operating and the number keeps increasing rapidly. Most of them are from hackers trying to find some vulnerability in your code. They may be trying to get credit card numbers from an online store or may be scraping the text off an article and posting it on some random blog. They may also want to steal username and passwords of peole on your database hoping people use same credentials elsewhere which surprisingly they do. Some may just want to post spam comments on your website.

How to stop bad bots from crawling your website

Bots are supposed to obey the rules within your robots.txt file however bad bots simple ignore them. So just blocking them through robots.txt is not enough (although I recommend that you still do this step). The other way to block them is blocking their IP in your WAF. That too is not the most efficent way because they keep changing their IP address. Also the same IP could be used by some genuine user in case the IP is part of a network. In that case you could potentially block a genuine visitor too.

The recommended way to block is to check the User Agent of the requestor and block them on that basis. In some cases even that is not sufficient because the User Agent can also be faked to be coming from Google or MSN. In that case we need to carefully study other aspects of the request.

Let’s start with some basic understand about blocking or allowing robots/crawlers.

Basics about blocking/allowing bots

Using robots.txt

Do not allow any bot (user-agent) to access any part of your site

Allow any bot (user-agent) to access any part of your site

Do not allow bingbot to access any part of your site

Allow bingbot to access any part of your site

Allow bingbot to access your site, disallow it to access your wp-admin folder

Using Meta Tags:

Allow all bots to access your page(s)

Allow all bots to access your page(s) and follow links on the pages

Allow all bots to access your page(s) but do not allow them to follow links

Do not allow any bots to access your page(s)

Allow bingbot to access your page(s)

Do not allow Yahoo! Slurp to access your page(s)

Allow Yahoo! Slurp to access your page(s) and follow the links to more pages

Again bad bots can easily ignore these rules just like robots.txt. The only solution is to block them through htaccess.

Stop bad bots through HTACCESS:

Generally this method is not required for good bots but then why would you need to block them anyway.

This method is for the bad bots to forcefully stop them getting into your site although you have warned them through your robots.txt. Through htaccess we can block the bot with IP and with User Agent or both. We will look into both the methods. The most recommended method though is to block them by User Agent.

A. Blocking IPs

The above rule in your .htaccess file will block a range of IP

B. Block by User Agent

This is the most recommended way and a good starting point to stop bad bots.

I have listed some bad bots already in the above code. Each line checks for a particular bot User Agent.

NC means ignore spell check and OR means there are more rules coming which are to be taken into account.

The last line sends a 404 error page to the bots listed above.

Another way of achiving the same result is using below code. This requires mod_env apache module to be installed.

Advanced methods to stop bad bots

The above methods completes our basic protection against bad bots. There is just an additional layer of complication to what we have seen so far.

  • What if some XYZ bot pretends to be coming from Google?
  • Also not all bots which we are not aware aren’t bad. So how do we differetiate between good and bad bots to stop bad bots.

Some general guidelines to answer above questions

Study the crawling pattern of the crawlers. Check how aggressive they are.

Don’t assume that the aggressive ones will always be picked by your WAF. Some may be one step ahead of you. I mean they may  not be so aggressive so as to get picked by your WAF or they may be changing IPs too often or they may also be using one of your vulnerable plugins

Here is an article about fake google bots. Although it is old it is still applicable. On the contrary things have gone worse.

As per the report

For every 24 Googlebot visits a website will also be visited by a Fake Googlebot

Just recently I noticed that my cloudflare pro account started blocking google bot IPs. Please see below screenshot.

Stop bad bots

I was just about to unblock it but then I thought of checking further details. That time I came to know that the fake googlebots exist. Until this time I was blaming google for indexing my site too often and too aggressively. Below are the details of the fake google bot through nslookup.

Clearly the above bot is not a genuine google bot. Lot of bad bots use amazon server. Howerver that is definately not the deciding factor for being a good or a bad bot.

One way is to check using below command

2.100.74.52.in-addr.arpa domain name pointer ec2-52-74-100-2.ap-southeast-1.compute.amazonaws.com.

while the one below returns

1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

which tells that the second one is a google bot as it is coming from googlebot.com

Server load monitoring tools for your wordpress site

Server load monitoring tools for your wordpress site

screen-shot-2016-06-22-at-6-43-45-pm-768x308

Here are some basic tools which will allow you to monitor server load to keep your wordpress site optimised.

Uptime (shell command)

Above command is an example of the uptime command. It says the server is up since 364 days, 2 users are logged in and the rest of the numbers are showing the server average load. The three numbers show the load averages for the last minute, 5 minutes and 15 minutes.

If you have 4 CPUs and the load is 2 then your server is using half the CPU capacity.

If you have 2 CPUs and the load is 2 then your server CPU is running at full capacity.

A load above the number of CPUs means that the system is overloaded which reduces performance.

top (shell command)

top command shows information like tasks, memory, cpu and swap. Here is a sample output of the top command.

Server monitoring

PHP sys_getloadavg function

sys_getloadavg function returns an array.

In above code, $load[0] would be the server load value.

Based on the above code we could even stop Dashboard access temporarily for all your editors if the load increased to a certain limit. Just add below code to your functions.php

If the server load reaches above 0.8 then all your editors and admins would be redirected to your home page expect the users having the capability as high_load_dashboard_access.

So only users having the above capability can access the system to try and monitor what is going on.

This is quite useful in cases you are experiencing heavy load on the server and editors are not able to upload or edit content and keep refreshing the page thereby causing extra load unnecessarily on the server.

Monyt App

Install Monyt App on your mobile. Its quite easy to configure. Depending on your device and server just add the necessary server monitoring file on the server and provide its URL within your app.

The file on the server runs some server monitoring commands and creates a json file which you can password protect. Simply add this file URL to Monyt app.

Points to consider before installing a new wordpress plugin

Installing a new plugin is very easy in wordpress. All you need to do is to search for your plugin, select it, install and activate. If you do not find it suitable to your needs, just deactivate, delete it and move on.

However not many people realise what a plugin does in the background once it gets activated and assume that once a plugin is deleted its all gone which is really not true in most cases. A deleted plugin mostly leaves quite a few traces in your system. Depending upon the plugin these traces can severely affect the performance of your system if you try too many plugins without checking what it does in the background

Many plugin developers do not follow WordPress coding standards. They do not provide an unintall function for the plugin. This means you need to manually clean up all the traces of the plugin after it is deactivated and deleted.

I am currently not experiencing any performance issue with my site

Although your site may be performing well currently, if you do not consider cleaning your system then soon a badly developed plugin can cause wordpress queries to slow down. Once the system starts slowing down it will be difficult to clean the system as by then you won’t be sure which tables are currently in use and which records in the database are unnecessary as it is quite easy to forget which plugins you tried in the past.

e.g. Some plugins developers are not quite efficient in their coding. They save each setting as a separate row in the wp_options table. Ideally all settings could be stored in a single row if setting are stored in an array. I remember a plugin which created 300 rows in the wp_options table to store settings with autoload set to yes which means the rows get selected on every Dashboard page load.

WordPress gets all the rows on each Dashboard page. If the table size grows bigger it will definately slow down the sytem.

Points to consider before installing a new wordpress plugin

First of all Install, activate and configure the plugin on your test environment or localhost

Check the source code both on Dashboard and your website: Check how many CSS and JS calls the plugin is making, whether it adds any inline CSS and other extra code.

Some plugins developers do this quite efficiently. They provide a settings screen and ask you on which Posts Types you want the plugin to run which means plugin related CSS or JS files only appear on those Post Types.

Are the queries optimised: Check if your new plugin is executing any queries which can slow down your system. To monitor such slow queries make use of Query Monitor plugin. Here are some examples how to make use of Query Monitor plugin to increase performance of your WordPress site and to detect slow queries hampering your system.

Does it create new tables: Check if the plugin creates any new tables. If yes make a note of those tables somewhere. Check what data it adds to those tables and if the tables are really necessary or the plugin developer could have used WordPress inbuit tables. This will give you an idea if the plugin developer has thought about optimising the code. Additional tables are ok but to use wordpress inbuilt tables would be ideal as wordpress has already optimised queries for those tables.

If you have a multisite, most probably the plugin would create extra tables for all your sites.

What does it add to wp_options table: Some plugin authors add _tansient records to this table. e.g. a plugin used for related links would add related links for each post in this table. So for 20k posts there would be 20k records in this table. These records get selected on every dashboard page load. This would slow down the dashboard to a great extent.

wp_options table is generally used to store plugin settings. Some plugin authors store each setting in a separate row. Ideally all settings should be stored in a single row in an arrow form.

Does the plugin provide an uninstall function: A good plugin developer provides an uninstall function which deletes the tables, capabilities, entries in wp_option table etc once the plugin is deactivated.

 

How to tackle WordPress slow queries

How to tackle WordPress slow queries

859241997_aaa015e54c_z
Photo credit – elisfanclub859241997

Here are some wordpress slow queries i.e. queries which take more than 0.05s. It really depends on your wordpress site i.e. how big is the database, plugins and your site configuration. However if you are facing performance issues related to the Dashboard then it is more likely to be due to the slow wordpress dashboard queries.

Query Monitor is good plugin to check/analyse your slow queries.

Some WordPress Slow queries

Below query auto populates the custom fields drop down box.

For large tables this query can take lot of time like 2secs or so. If you do not need custom fields it is very easy to turn them off using below function.

For more information read this interesting post on CSS Tricks

Below query runs on every Dashboard page so it is important that your wp_options table is optimised.

Depending upon the plugins you have installed, the wp_options table size can grow rapidly. Some plugins use this table to store _transient options. These _transient options are objects stored in cache. For e.g. a plugin called as Manual Related Posts stores related links per post in separate rows as _transient options. So if you have 50K posts there would be 50K rows in this table. The size of the table can also grow rapidly because each row would atleast be 1M in size.

The table structure is also not optimised properly for e.g. option_id column is defined as bigint type, autoload is set to var. It should have been set as boolean or enum. Depending upon the table size and other configurations it can even take 8secs to run above query which is quite alarming.

Few tips to optimize wp_options table

  1. Check for _transient entries and if possible replace the plugins which create lot of _transient entries. Please note that although the purpose of these entries is for caching, since this table runs crucial queries on all Dashboard pages, it defeats the purpose for large wordpress installations. The table structure also does not help the cause.
  2. If it is not possible to replace the plugins creating lot of _transient entries then use the Transient Cleaner plugin which will delete the expired transient entries and will automatically do the housekeeping for you.
  3. Change the table structure a bit. Add autoload column to the list of indexes, change the option_id to int (12)