Nov 2009

Nagios Meta Check Script in Perl Part 1: Generic and Core Functions

Ironically there has not been as much crossover in my current environment (at least as of this writing) and my hobbyist/home coding/administration life in quite some time. For once I have found something I wrote at my job that has a direct translation into something other admins would find useful; a script that performs and reports multiple checks at once. In this the first part of the series a look at the motivation, helper functions and some core generic functions of the script.

Background

I provision a lot of systems, a lot of them; almost 1/week (although not at that frequency - the frequency varies) - some of these systems are virtual machines, some are cloned virtual machines, some are physical servers while in the rare instance some are just devices of some type (or really dumb servers). Part of the post installation setup is adding systems to the appropriate Nagios monitors. After doing this for over a year I saw a pattern emerge - every system needed to have a set of common checks:

  • Check the system time against a known good source.
  • Check the load average.
  • Check local disk usage.

Of course depending upon one's environment that list may need more or less.

The Solution(s)

There are many solutions to this sort of problem. One could have a config file for every host and simply drop in the new values and the config files sourced under one directory for instance. The only quick solution at the time for my configuration was to be able to wrap multiple checks into one. I looked at how I might do so within nagios and determined I was too lazy to figure out if some sort of dependency relationship could be used; since I am so lazy I looked for a plugin to do this but found it needed an agent - again being lazy I did not want to have to install agents unless there was an absolute need. The answer (as usual) became obvious - write a wrapper.

The script presented within is a first draft, there are known bugs, however, the idea behind this text is to present an idea and method for other admins to adopt so they can formulate their own similar script and be more efficient.

Starting Bits

So first up some signals to trap and variables, I commented up a lot to avoid having to explain the details in the text:

# Signals we are interested in dealing with, the right operand is the 
# subroutine which handles the given interrupt type
$SIG{'INT' } = 'interrupt';
$SIG{'HUP' } = 'interrupt';
$SIG{'ABRT'} = 'interrupt';
$SIG{'QUIT'} = 'interrupt';
$SIG{'TRAP'} = 'interrupt';
$SIG{'STOP'} = 'interrupt';

# Globals
my $USER1="/usr/local/nagios/libexec"; # Be consistent wrt Nagios
my $CHECK="HEALTH"; # the name of the check; feel free to change
my $OUTFILE = "/var/tmp/healthcheck.tmp"; # an outfile for later use

# Where we store cherry picked results; init these to a space in case they
# are not all collected
my @LOAD_VALUES = " ";
my @SYSTIME_VALUE = " ";
my @ROOTDISK_VALUE = " ";
# Default values for LOAD, ROOTDISK Usage
my $DEF_LOAD_WARN = "4,2,2";
my $DEF_LOAD_CRIT = "5,4,3";
my $DEF_DISK_WARN = 95;
my $DEF_DISK_CRIT = 98;
my $DEF_SNMP_COMMUNITY = "public";

my $STATUS = 0; # A status var to be returned to nagios

# Flags
$DNS = 1; # do check that this host has a DNS entry
$PING = 0; # don't preping by default since nagios does, switch to 1 if
           # you want to preping before bothering with the rest

# Brain dead interrupt handler
sub interrupt { # usage: interrupt \'sig\'
    my($sig) = @_;
    die $sig;
    die;
}


# Generic sub: Load a file into an array and send the array back 
sub load_file {
    my ($file) = shift;
    my @flist;

    open(FILE, $file) or die "Unable to open logfile $file: $!\n";
    @flist = <FILE>;
    close FILE;

    return(@flist);
}

So far so good, we setup out LOAD and DISK parameters in addition to arrays to capture returned results. We are relying upon snmp checks for these but note the script could be modified to use SSH etc.

Core Generic Functions

Now it is time to move onto the functions that do the work, first up is a generic return parser to construct what will be sent back to nagios:

# Handle results status and print a final message with values of collated data
sub check_exit { # usage: check_exit("message string",RETVAL)
    my ($msg,$ret) = @_;

    # determine our status and exit appropriately
    if ($ret >= 3) {
        print "$CHECK UNKNOWN: $msg ";
    } elsif ($ret == 2) {
        print "$CHECK CRIT: $msg ";
    } elsif ($ret == 1) {
        print "$CHECK WARN: $msg ";
    } elsif ($ret == 0) {
        print "$CHECK OK: $msg ";
    } else{
        print "$CHECK UNKNOWN STATE: $msg ";
    }

    # print what we collected - note if one fails we do not collect the rest
    chomp (@SYSTIME_VALUE);
    chomp (@LOAD_VALUES);
    print("@SYSTIME_VALUE, System Load @LOAD_VALUES, Rootdisk @ROOTDISK_VALUE");
    unlink($OUTFILE); # delete the temp file for good
    exit ($ret);      # exit appropriately so nagios knows what to do
}

The exit function uses the return number to determine the warning level (if any) and passes along an optional message string. Note that regardless of which check failed the function returns all available data. The idea was (when it was written) if disk is low it might be causing a high load etc. The next function greps CRITICAL or WARN from the output file; this is because the script actually calls other Nagios checks which will leave the status string in the output file:

# Check the outfile in some cases for a SNMP warn or critical
# send back the appropriate signal for nagios 
sub check_outfile { # usage: check_outfile
    my @critical = `grep CRITICAL $OUTFILE`;
    if (@critical) {
        return 2;
    }

    my @warn = `grep WARN $OUTFILE`;
    if (@warn) {
        return 1;
    }

    return 0;
}

Last - at least for part one - is the usage, note that not all of the capabilities have been scripted yet so this is a look ahead (kind of) at the next text:

# ye olde usage message
sub usage {
    print "Usage: $0 [-u[-H ||[ -lw  -lc  -dw  -dc ]]\n";
    print "Usage: $0 [--nodns][--noping][--snmp \"community [user] [pass]\"\n";
    print "Options:\n";
    print " -H       Check system called  (required)\n";
    print " -lw     Set load warning values\n";
    print "                Default: $DEF_LOAD_WARN\n";
    print " -lc     Set load critical values\n";
    print "                Default: $DEF_LOAD_CRIT\n";
    print " -dw     Set rootdisk warning percent\n";
    print "                Default: $DEF_DISK_WARN\n";
    print " -dc     Set rootdisk critical percent\n";
    print "                Default: $DEF_DISK_CRIT\n";
    print " --nodns        Do not check for DNS resolution\n";
    print " --noping       Do not preping to make sure the host is up\n";
    print "                Note: this will improve performance\n";
    print " --snmp   Set SNMP community name\n";
    print "                Default: $DEF_SNMP_COMMUNITY\n";
    print " -u             Print usage message and exit\n";
}

Summary

The first part of the script is done - covered so far: helpers, some core simple generic functions and of course the usage message. In the next text some internal (optional) pre-checks, calls to the actual checks themselves and finally the main part of the script.