Stop bad bots from crawling your website

Stop bad bots from crawling your website

15210144673_b37c806986_z-e1467018790709

If you are experiencing performance issues with your site one of the reasons could be your Bandwidth is getting unnecessarily utilised by some bad bots. This article explains what are bad bots, how to identify them and how to stop bad bots crawling your website so that your genuine visitors are not affected.

What are bad bots upto?

There are quite a number of bad bots operating and the number keeps increasing rapidly. Most of them are from hackers trying to find some vulnerability in your code. They may be trying to get credit card numbers from an online store or may be scraping the text off an article and posting it on some random blog. They may also want to steal username and passwords of peole on your database hoping people use same credentials elsewhere which surprisingly they do. Some may just want to post spam comments on your website.

How to stop bad bots from crawling your website

Bots are supposed to obey the rules within your robots.txt file however bad bots simple ignore them. So just blocking them through robots.txt is not enough (although I recommend that you still do this step). The other way to block them is blocking their IP in your WAF. That too is not the most efficent way because they keep changing their IP address. Also the same IP could be used by some genuine user in case the IP is part of a network. In that case you could potentially block a genuine visitor too.

The recommended way to block is to check the User Agent of the requestor and block them on that basis. In some cases even that is not sufficient because the User Agent can also be faked to be coming from Google or MSN. In that case we need to carefully study other aspects of the request.

Let’s start with some basic understand about blocking or allowing robots/crawlers.

Basics about blocking/allowing bots

Using robots.txt

Do not allow any bot (user-agent) to access any part of your site

Allow any bot (user-agent) to access any part of your site

Do not allow bingbot to access any part of your site

Allow bingbot to access any part of your site

Allow bingbot to access your site, disallow it to access your wp-admin folder

Using Meta Tags:

Allow all bots to access your page(s)

Allow all bots to access your page(s) and follow links on the pages

Allow all bots to access your page(s) but do not allow them to follow links

Do not allow any bots to access your page(s)

Allow bingbot to access your page(s)

Do not allow Yahoo! Slurp to access your page(s)

Allow Yahoo! Slurp to access your page(s) and follow the links to more pages

Again bad bots can easily ignore these rules just like robots.txt. The only solution is to block them through htaccess.

Stop bad bots through HTACCESS:

Generally this method is not required for good bots but then why would you need to block them anyway.

This method is for the bad bots to forcefully stop them getting into your site although you have warned them through your robots.txt. Through htaccess we can block the bot with IP and with User Agent or both. We will look into both the methods. The most recommended method though is to block them by User Agent.

A. Blocking IPs

The above rule in your .htaccess file will block a range of IP

B. Block by User Agent

This is the most recommended way and a good starting point to stop bad bots.

I have listed some bad bots already in the above code. Each line checks for a particular bot User Agent.

NC means ignore spell check and OR means there are more rules coming which are to be taken into account.

The last line sends a 404 error page to the bots listed above.

Another way of achiving the same result is using below code. This requires mod_env apache module to be installed.

Advanced methods to stop bad bots

The above methods completes our basic protection against bad bots. There is just an additional layer of complication to what we have seen so far.

  • What if some XYZ bot pretends to be coming from Google?
  • Also not all bots which we are not aware aren’t bad. So how do we differetiate between good and bad bots to stop bad bots.

Some general guidelines to answer above questions

Study the crawling pattern of the crawlers. Check how aggressive they are.

Don’t assume that the aggressive ones will always be picked by your WAF. Some may be one step ahead of you. I mean they may  not be so aggressive so as to get picked by your WAF or they may be changing IPs too often or they may also be using one of your vulnerable plugins

Here is an article about fake google bots. Although it is old it is still applicable. On the contrary things have gone worse.

As per the report

For every 24 Googlebot visits a website will also be visited by a Fake Googlebot

Just recently I noticed that my cloudflare pro account started blocking google bot IPs. Please see below screenshot.

Stop bad bots

I was just about to unblock it but then I thought of checking further details. That time I came to know that the fake googlebots exist. Until this time I was blaming google for indexing my site too often and too aggressively. Below are the details of the fake google bot through nslookup.

Clearly the above bot is not a genuine google bot. Lot of bad bots use amazon server. Howerver that is definately not the deciding factor for being a good or a bad bot.

One way is to check using below command

2.100.74.52.in-addr.arpa domain name pointer ec2-52-74-100-2.ap-southeast-1.compute.amazonaws.com.

while the one below returns

1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

which tells that the second one is a google bot as it is coming from googlebot.com

Server load monitoring tools for your wordpress site

Server load monitoring tools for your wordpress site

screen-shot-2016-06-22-at-6-43-45-pm-768x308

Here are some basic tools which will allow you to monitor server load to keep your wordpress site optimised.

Uptime (shell command)

Above command is an example of the uptime command. It says the server is up since 364 days, 2 users are logged in and the rest of the numbers are showing the server average load. The three numbers show the load averages for the last minute, 5 minutes and 15 minutes.

If you have 4 CPUs and the load is 2 then your server is using half the CPU capacity.

If you have 2 CPUs and the load is 2 then your server CPU is running at full capacity.

A load above the number of CPUs means that the system is overloaded which reduces performance.

top (shell command)

top command shows information like tasks, memory, cpu and swap. Here is a sample output of the top command.

Server monitoring

PHP sys_getloadavg function

sys_getloadavg function returns an array.

In above code, $load[0] would be the server load value.

Based on the above code we could even stop Dashboard access temporarily for all your editors if the load increased to a certain limit. Just add below code to your functions.php

If the server load reaches above 0.8 then all your editors and admins would be redirected to your home page expect the users having the capability as high_load_dashboard_access.

So only users having the above capability can access the system to try and monitor what is going on.

This is quite useful in cases you are experiencing heavy load on the server and editors are not able to upload or edit content and keep refreshing the page thereby causing extra load unnecessarily on the server.

Monyt App

Install Monyt App on your mobile. Its quite easy to configure. Depending on your device and server just add the necessary server monitoring file on the server and provide its URL within your app.

The file on the server runs some server monitoring commands and creates a json file which you can password protect. Simply add this file URL to Monyt app.

WordPress vulnerability – Bypass any password protected post

WordPress has just released a new update version 4.5.3 which is mainly a security release fixing a major security issue present in all the previous wordpress versions. It is strongly recommends to update your sites immediately to the latest version.

This vulnerability allows an attacker to gain access to password protected posts in wordpress. This vulnerability is high in case of wordpress installations with open registrations.

Wordfence, a popular wordpress security plugin disclosed this vulnerability to wordpress on 3rd May

How to prevent WordPress CSRF attack

How to prevent WordPress CSRF attack

WordPress CSRF attack happens the same way as it happens on other sites. WordPress provides some inbuilt tools to protect against CSRF. We will see how to make use of these tools while creating our own wordpress plugins.

Wordpress CSRF Attack
Photo credit – 2508581015littleblackcamera

What is CSRF ?

CSRF meansCross-Site Request Forgery (CSRF). It is a type of attack that occurs when a malicious web site, email, blog, instant message, or program causes a user’s web browser to perform an unwanted action on a trusted site for which the user is currently authenticated.

How does it happen ?

For e.g. if you have a form on your website and you haven’t protected it for CSRF attacks then a hacker can create a similar form elsewhere and trick one of your users to submit the form. This means the hacker can fill any values in the form. The damage depends on the functioning of the form.

How to prevent CSRF

In short, to prevent CSRF attack all we need to do is to check if the right user is performing the right action on your website.

WordPress CSRF attack and Nonces

WordPress has inbuilt facility called as Nonces to prevent such attacks. Basically nonce is some code (mix of letters and numbers) which is automatically generated and sent as a hidden field in the form. This number is then compared with the number on the submit page and further action allowed only if both the numbers match. This number has limited lifetime and keeps changing after every regular interval i.e. after the lifetime of that particular nonce for that user has reached. Although the hacker could see this number in your source code the number would not be valid as it depends on the user and it keeps changing.

However wordpress nonces are not the only solution to prevent CSRF. We also need to check user permissions before executing a certain action.

CSRF protection on your forms

Some form fields here In the above form the function wp_nonce_field creates a hidden field with some nonce string.

Below code goes on your submit form action/page

On the submit page the nonce value in the hidden field is validated using the function wp_verify_nonce then only the form gets processed.

CSRF protection on your AJAX calls

Prevent wordpress csrf  attack by protecting your Ajax calls too. jQuery calling an unprotected PHP page can have severe security implications.

Here is how we can apply CSRF protection on Ajax calls

Below code goes on your PHP page

check_ajax_referrer verifies the AJAX request to prevent processing external (malicious) requests.

Show custom field validation errors in WordPress Admin

Show custom field validation errors in WordPress Admin

Wordpress Admin Notices

If you are creating your own custom post type in wordpress and you use some custom fields to store data related to each post.

For e.g. if you create a custom post for events then you would store data like event start date, end date, address, etc within custom fields.

Unless all the all required custom fields are filled you do not want to publish the event and so you would want to warn the event editor about it.

WordPress admin_notices hook allows to achieve this very easily

admin_notices is the hook available to display the messages

add_settings_error – Registers the setting error to be displayed to the user

settings_errors – This function simply displays all the errors line by line

Setup wordpress cron jobs and debug methods

Setup wordpress cron jobs and debug methods

6261230701_7368aa73d6_z

Why would you need to setup WordPress cron jobs ?

Cron jobs in WordPress can be set for following reasons

  1. You have a custom post type for events and you wish to archive all your old events at regular intervals
  2. Archive logs e.g. if you have some plugin which tracks user activity then you may want to archive the logs table at regular intervals
  3. If you have integrated wordpress with some external application e.g. a mailing server then you need to synchronise your mailing lists with the external server
  4. Clearing caches for certain pages

How to setup wordpress cron jobs

wp_schedule_event is the function used to set up wordpress cron jobs

Here’s a sample code to set up a cron job

The above code will execute the function wpi_some_cron_job on daily basis. Some of the other options are hourly, twicedaily

Now the question is where do we add the above code. The answer to this depends whether you want to set this up as part of your own custom plugin or this cron is just some adhoc function which you wish to execute for housekeeping purpose e.g. clearing expired transients in your wp_options table.

Here is the code to add the cron job in the activation hook of your own plugin

Here is the code to add the cron job within your functions.php

If you have added the code to the activation hook of your plugin, it needs to be deactivate once the plugin is deactivated. Here’s how you deactivate.

Debugging WordPress Cron jobs

A. Try triggering the WordPress cron engine manually by opening below URL in your browser

http://example.com/wp-cron.php?doing_wp_cron

B. Turn on WP_DEBUG on your development environment by adding below line in your wp-config.php file

C. Create some custom field for debugging purpose and set the custom field to increment every time through a cron job runs. Then check if the field gets updated.

Hide or change name of Publish button for Custom Post Type

Lets assume you are creating a Custom Post Type for publishing events. You do not wish to show the publish button to your editors until all the event fields are properly filled and validated.

This can be achieved using below code

 

Selectively exclude pages from being cached in W3 Total Cache

Selectively exclude pages from being cached in W3 Total Cache

screen-shot-2016-06-08-at-4-37-21-pm-768x243

You may want to exclude some dynamic pages from being cached in W3 Total cache. To exclude a particular page from caching, W3 Total cache just needs below line of code on that page before the html  start tag

For this code to appear on such pages all you need to do is to create some custom field e.g. nocache

Just add this custom field to the post/page you want to exclude from caching and set the value to 1

In your header.php add below lines above the html tag

Other w3total cache options

  1. Disable database caching => DONOTCACHEDB
  2. Disable minify => DONOTMINIFY
  3. Disable CDN (Content Delivery Network) => DONOTCDN
  4. Disable Object Caching => DONOTCACHCEOBJECT
Sendgrid Contacts API  Examples

Sendgrid Contacts API Examples

screen-shot-2016-06-06-at-1-42-42-pm-768x294

If you are using Sendgrid to send your Marketing / Promotional emails then using the Sendgrid Contacts API can automate quite a few things for you. Here are some simple API examples to add/remove/edit recipients to the Sendgrid Contacts Database and synchronising the Sendgrid lists with the WordPress mailing lists.

There are 2 ways to synchronise your mailing lists in wordpress with the Sendgrid lists. One is using the daily/weekly cron job and the other is through real time i.e. an on demand system.

Obviously we all would prefer the real time system. However it largely depends on how you handle your subscription process in WordPress.

  • If you have created your own subscription system then you are either storing the contacts in your own table or you are using the WordPress usermeta table. API can be called by making the necessary changes in your system. Synchronisation of the mailing lists in that case can be done in real time.
  • If you are using some plugin which offers hooks on various actions like
    • user subscription,
    • user confirmation/verification,
    • user unsubscription,

    simply use this hooks to provide real time synchronization by adding necessary code in functions.php.

  • However if you are using some wordpress plugin which does not offer hooks as mentioned above then the only option is using cron job.

Here are some Sendgrid Contacts API PHP examples. They show how to synchronize receipients on your existing mailing list with Sendgrid.

Here is the official documentation on Sendgrid API

Add New Recipient – Sendgrid Contacts API

Above code adds a new receipient to Sendgrid contacts Database. It does not add if the receipient already exits but in any case does return the Recipient ID of the recipient. Also the above code does not assign any lists to the contact. To assign some Sengrid list to the contact we need to know the recipient ID of the contact which Sendgrid returns after adding a new contact to the Database. Through the above code we get the receipient ID in the $receipient_id variable. You can store this variable within your wordpress table as you would need it  to perform the Sendgrid tasks related to the recipient whenever the contact gets edited within WordPress. If you do not store this variable in your wordpress table then you would need to execute above code each time for the contact you need to get the recipient ID.

Add Receipient to a particular list – Sendgrid Contacts API

Above function requires the ID of the Sendgrid List (SENDGRID_LIST_ID) to which the recipient needs to be added.

You can get the Sendgrid List ID from the Sendgrid interface. Just click on the list and you will notice the List ID in the URL

If you have added the receipient ID received from Sendgrid to your own table in WordPress then you will get it from there else you may need to run the Add new Recipient function again to get the receipient ID.

Remove/Delete recipient from a list – Sendgrid Contacts API

This function requires Sendgrid List ID and the Recipient ID as input. It removes the receipient from a particular list. It does not delete the recipient from the Sendgrid Contacts Database.

Offsite data storage for Disaster Recovery

Offsite data storage for Disaster Recovery

2631871046_e876569317_z
Photo credit – williamhook2631871046

Offsite data storage for your website means copying your database, code, media and other files to a remote server so that in case of any disaster the server can be rebuilt using the data available on the remote server.

Copying your database and code backups and other files like images, etc to a remote/offsite server can be done for various reasons. It is mostly done for 2 reasons.

  1. Preparing for Disaster Recover where the server can be rebuild using the data from offsite data storage
  2. Redundancy (when the live server stops due to some reason, the remote server can take over)

Although below steps can work in both the cases, this particular article is written with a viewpoint of understanding how to setup an offsite data storage for Distaster Recovery.

For preparing disaster recovery the remote server should ideally follow below criteria

Backup server requirements

  1. The hosting provider should be different to your live/production server
  2. The hosting server should be in a different location to your live/production server
  3. Bandwidth may not be a crucial factor but Disk space is important so you need to plan a proper size server after studying your data and storage requirements

How to handle different file types for offsite data storage

1. Database backups

There are number of ways database can be copied/replicated on the remote server

  1. Real time: Techniques like mysql replication helps to achive real time data synchronization on live and remote server. Any updates performed on the live server will be replicated to the remote/slave server. Here is a documentation on how to set up mysql replication. For this method it is important that your database is accessible over the network.
  2. mysqldump: This is very simple way of copying the live server database to the remote server. Read how to create a script to copy database from the live server to the remote server automatically. For this method it is not required to have your database accessible over the network however the changes would not be realtime. It would depend on the cron job you set to achive this.
  3. Backup Utility: There are various 3rd party utilities. Read how Sypex Dumper (SXD) can create automatic backups for you which you can transfer to the remote server automatically on daily basis. The database can then be imported through the SXD installation on the remote server.

2. Application/code files and images

The best way to trasfer code files and images is using rsync (remote sync) command. Below rsync command runs on the origin/live server to copy the incremental changes to the remote server. This script can be set as cron job to run on daily basis.

3. Housekeeping on the remote server

If you are transferring database to the remote server on daily basis, soon your remote server disk might get full. It is important to set up some housekeeping script on the remote server.

Above script keeps last 10 days of database backup files and delates the rest. Again a cron job can be set up on the remote server to perform this automatically on daily basis.

12