Deathwing00's Space WebLog of Ioannis Aslanidis (aka Deathwing Zero Zero)

2Dec/110

Check that the vpn is up and running in nagios

A problem I had to deal with once, was to be able to use nagios to know when the VPN stopped working. The conflict I had by just using simple private network ping checks is that I was unable to easily tell at first glance whether it was the VPN that had a problem or the whole network.

Thus, I wrote this little script that receives three arguments and is able to tell whether it is actually the VPN that has any problems. Besides, it also checks for packet loss and is able to detect if the whole public network goes down.

The three arguments that receives this check are the following:

  1. Public IP Address: A public IP address to ping in order to know whether the network is working properly.
  2. Private (VPN) IP Address: A private network IP address to ping in order to know whether the VPN is working properly.
  3. Number of pings: Total number of pings to send in order to know more about packet loss.

The check issues a CRITICAL only if there is 100% ICMP loss while pinging the private network and the public network is suffering less than 100% ICMP loss.
The check issues a WARNING if either the public or the private or both IP addresses suffer any kind of packet loss. It also issues a WARNING if the public network suffers 100% ICMP loss.

You can download the full source code for this check in the following link: check_vpn

20Mar/107

Check that a physical link is up with the proper speed

This check is great to detect when a network cable for whatever reason deteriorates and stops providing the desired up-link speed. It works perfectly for any system that has ethtool installed.

This particular check has helped me as a sysadmin to detect bad quality cables that, after being reused many times, end up deteriorating and do not let me get 1Gbps in RJ-45 CAT 5E cables. I have also been able to detect network card failures, and also malfunctioning switch ports.

7Mar/100

Check that an FTP account is fully working

This script uses lftp, a sophisticated ftp/http client, to check not only that a give FTP account is accessible, but that it is also able to list files and directories, to get and put files and to delete files. This simple script is fast, easy to configure, flexible and can be extended easily.

Sometimes, things like SELinux, a failed network mount point or wrong permissions cause an FTP account to not work properly. With this check, you will be able to detect it immediately.

7Mar/100

Check that any network filesystem partition is correctly mounted

A colleague of mine, Thomas Blanchin, has improved my glusterfs mounted nagios check so that it works properly with any network file system. It generates the proper output and can be used for any network file system without much trouble.

17Feb/100

Check that a glusterfs partition is mounted

When using glusterfs in a production system, it is mandatory to properly monitor that the partition is mounted and performing well, specially in heavy loaded environments.

I have created a nagios plugin in bash that monitors a glusterfs mounted partition and detects whether the partition gets unmounted, responds slowly or gets disconnected from the server (causing reading processes to die in an uninterruptible sleep state, which will force you to restart the system in order to get rid of them).

17Feb/100

Check the percentage of CPU consumed by processes with the same name during a certain interval

Many nagios scripts use ps to compute the percentage of CPU consumed by a process. Although at first instance this might seem a good approach, if you read properly the documentation, you will notice this:

CPU usage is currently expressed as the percentage of time spent running during the entire lifetime of a process. This is not ideal, and it does not conform to the standards that ps otherwise conforms to. CPU usage is unlikely to add up to exactly 100%.

This means that ps is useless if you require to know whether a certain process is consuming a lot of CPU percentage during a given interval. For instance, imagine that you want to detect whether a given process has hanged and is consuming lots of CPU; using ps you will be completely unable to detect it.

In order to work around this and provide a proper monitoring solution for this type of problem, I have written a script in python that calls top. This command does offer the percentage of CPU during a given interval, not for the whole lifetime of the process.