Deathwing00's Space WebLog of Ioannis Aslanidis (aka Deathwing Zero Zero)

20Mar/107

Check that a physical link is up with the proper speed

This check is great to detect when a network cable for whatever reason deteriorates and stops providing the desired up-link speed. It works perfectly for any system that has ethtool installed.

This particular check has helped me as a sysadmin to detect bad quality cables that, after being reused many times, end up deteriorating and do not let me get 1Gbps in RJ-45 CAT 5E cables. I have also been able to detect network card failures, and also malfunctioning switch ports.

7Mar/100

Check that an FTP account is fully working

This script uses lftp, a sophisticated ftp/http client, to check not only that a give FTP account is accessible, but that it is also able to list files and directories, to get and put files and to delete files. This simple script is fast, easy to configure, flexible and can be extended easily.

Sometimes, things like SELinux, a failed network mount point or wrong permissions cause an FTP account to not work properly. With this check, you will be able to detect it immediately.

7Mar/100

Check that any network filesystem partition is correctly mounted

A colleague of mine, Thomas Blanchin, has improved my glusterfs mounted nagios check so that it works properly with any network file system. It generates the proper output and can be used for any network file system without much trouble.

17Feb/100

Check that a glusterfs partition is mounted

When using glusterfs in a production system, it is mandatory to properly monitor that the partition is mounted and performing well, specially in heavy loaded environments.

I have created a nagios plugin in bash that monitors a glusterfs mounted partition and detects whether the partition gets unmounted, responds slowly or gets disconnected from the server (causing reading processes to die in an uninterruptible sleep state, which will force you to restart the system in order to get rid of them).

17Feb/100

Check the percentage of CPU consumed by processes with the same name during a certain interval

Many nagios scripts use ps to compute the percentage of CPU consumed by a process. Although at first instance this might seem a good approach, if you read properly the documentation, you will notice this:

CPU usage is currently expressed as the percentage of time spent running during the entire lifetime of a process. This is not ideal, and it does not conform to the standards that ps otherwise conforms to. CPU usage is unlikely to add up to exactly 100%.

This means that ps is useless if you require to know whether a certain process is consuming a lot of CPU percentage during a given interval. For instance, imagine that you want to detect whether a given process has hanged and is consuming lots of CPU; using ps you will be completely unable to detect it.

In order to work around this and provide a proper monitoring solution for this type of problem, I have written a script in python that calls top. This command does offer the percentage of CPU during a given interval, not for the whole lifetime of the process.