Nov 2008

Writing Node Diagnostics

Managing and monitoring medium sized installations of 3-400 Unix systems is by no means an easy task. There exist a number of Open Source tools heavy handed and relatively lightweight that can perform process, service and network monitoring like Hobbit and Nagios. Sometimes tools like Hobbit are either not needed or require additional cases to perform diagnostics. In this text two aspects of writing and running diagnostics is covered:

  • A single callout script from a central system (such as a hobbit or nagios server).
  • Differing methods of delivering and executing diagnostics.

Test Cases

In the example three test cases are used:

  • Check to see if a shared user filesystem is readable and writable globally (0777) - if it is not correct it.
  • Make sure DNS resolution works.
  • See if all nfs mounts that are supposed to be mounted are in fact mounted or if one is mounted multiple times.

The test cases; for the time being; are written as single functions that return a result number (usually this can just be $? but later it will be observed this may not always work).

Permissions Check

chkperms()
{
        local dirs2chk="/usr/local/global_scratch"
        local perms=""

        for dir in $dirs2chk ; do
                perms=$(stat $dir | grep Access | head -n 1 | 
                        awk '{ print $2}' | grep 777)

                if [ $? -gt 0 ]; then
                        echo "filesystem: Permissions not set correctly on ${dir}"
                        chmod 0777 $dir ||
                                echo "Error! Cannot set 0777 on ${dir}"
                fi
        done

        return 0
}

Checking DNS

chkdns()
{
        local test_entry="www.foo.com"

        host $test_entry

        if [ "$?" -gt 0 ]; then
                return 1
        fi

        return 0
}

One might wonder - why not just return $?; simple - if the caller is remote $? may be overwritten at the calling system.

Checking NFS

chknfs()
{
        nfs_fstab=$(grep "^[^#].*nfs" /etc/fstab |awk '{print $2}' 2>/dev/null)
        nfs_mount=$(mount | grep nfs | awk '{print $3}' 2>/dev/null)

        for i in $nfs_fstab; do
                matches=0
                for j in $nfs_mount; do
                        if [ $i ==  $j ]; then
                                matches=$(($matches+1))
                        fi
                done

                if [ "$matches" -eq 0 ]; then
                        echo "nfs: ${i} not mounted"
                        return 1
                elif  [ "$matches" -gt 1 ]; then
                        echo "nfs: ${i} is mounted multiple times"
                        return 1
                fi
        done

        return 0
}

Test Delivery

Several methods of delivering and executing the tests exist:

  1. Wrap all of the commands in the functions in SSH.
  2. Mount a common directory to all of the systems using NFS or similar and execute.
  3. Install the scripts in each system then execute.
  4. Deliver the scripts via scp, execute, then remove them.

Option one is the most work and hence should be avoided. Option one also generates traffic. Options 2-3 actually have a common point - the scripts are on the local filesystem; to make things interesting method four will be used with the understanding that the only real difference between method four and two/three is the delivery. The quickest delievery method is to have the files either packed up in one file or in a directory. Since the connection is singular either method works.

Callout Wrapper

The calling script will need to either have all of the tests in it or source the test files which then call themselves. Sourcing works well, however, each test function will have to have a self reference call and to make reading them in an extension is used (although any simple pattern could be used); for example the DNS test:

#!/bin/bash
# dns.diag

chkdns()
{
        local test_entry="www.foo.com"

        host $test_entry

        if [ "$?" -gt 0 ]; then
                return 1
        fi

        return 0
}

chkdns

The callout script can reside in the same directory as long as it does not have the diag extension. Now all of the tests can be called by simply doing:

...
cd path_to_test_scripts
for test_script in *.diag ; do
        ./$test_script
done
...

Now it is time to flesh out the calling script. Remember this script will be on the local system. There is still one more script to write; the script that actually copies over the diagnostics.

#!/bin/sh
# rundiags.sh - Run local diagnostic checks
PROGRAM=${0##*/}
TOPPID=$$
DIAGS=$1 # We pass in the location of the directory as a variable

echo -n "`hostname`: "

cd $DIAGS
for d in *.diag; do
        chmod 0755 $DIAGS/$d
        $DIAGS/$d
        if [ $? -gt 0 ]; then
                echo "Error found on ${d}
        fi
done

echo " "

exit 0

Delivery and Remote Execution Script

The delivery script has to do the lion's share of work. It needs to do the following:

  1. Make sure the host is accessible.
  2. Copy up the diagnostic scripts directory.
  3. Remotely execute rundiags.sh.
  4. Remove the diagnostic scripts directory from the host.
#!/bin/bash
# remote_diags.sh - Deliver, execute and remove node diagnosis script(s)
# usage           - remote_diags "host1 host2 host3 ..."

trap interrupt INT

PROGRAM=${0##*/}
TOPPID=$$
DIAGS_LOCAL="/var/adm/remote-diags" # Where we have dns.diag, nfs.diag, 
                                    # rundiags.sh and perms.diag
DIAGS_REMOTE="/var/tmp/diags"       # Temporary spot on remote host

bomb()
{

        cat >&2 <<ERRORMESSAGE

ERROR: $@
*** ${PROG} aborted ***
ERRORMESSAGE
    kill ${TOPPID}      # in case we were invoked from a subshell
    exit 1
}

interrupt()
{
        echo "Trapped keyboard interrupt - exiting."
        exit 0
}       

isalive()
{
        host=$1

        ssh $node exit

        if [ "$?" -gt 0 ]; then
                echo "cannot connect to ${host} - skipping"
        fi
}

for remote_host in ${1} ; do
        isalive $remote_host
        scp -r $DIAGS_LOCAL $remote_host:/$DIAGS_REMOTE ||
                bomb "Could not scp ${DIAGS_LOCAL} to ${remote_host}:/${DIAGS_REMOTE}"
        ssh $remote_host $DIAGS_REMOTE ||
                bomb "Could not remote execute ${DIAGS_REMOTE}"
        ssh $remote_host "rm -rf $DIAGS_REMOTE"
                bomb "Could not rm ${DIAGS_REMOTE}"
done

exit 0

Final Versions

Below is a full revised listing of the example tests, calling script and the remote execution script for convienence:

Diagnostic Scripts in /var/adm/remote-diags

# chkperms.diag
chkperms()
{
    local dirs2chk="/usr/local/global_scratch"
    local perms=""

    for dir in $dirs2chk ; do
        perms=$(stat $dir | grep Access | head -n 1 |
            awk '{ print $2}' | grep 777)

        if [ $? -gt 0 ]; then
            echo "filesystem: Permissions not set correctly on ${dir}"
            chmod 0777 $dir ||
                echo "Error! Cannot set 0777 on ${dir}"
        fi
    done

    return 0
}

chkperms
# chkdns.diag
chkdns()
{
    local test_entry="www.foo.com"

    host $test_entry

    if [ "$?" -gt 0 ]; then
        return 1
    fi

    return 0
}

chkdns
# chknfs.diag
chknfs()
{
    nfs_fstab=$(grep "^[^#].*nfs" /etc/fstab |awk '{print $2}' 2>/dev/null)
    nfs_mount=$(mount | grep nfs | awk '{print $3}' 2>/dev/null)

    for i in $nfs_fstab; do
        matches=0
        for j in $nfs_mount; do
            if [ $i ==  $j ]; then
                matches=$(($matches+1))
            fi
        done

        if [ "$matches" -eq 0 ]; then
            echo "nfs: ${i} not mounted"
            return 1
                fi
        done

        return 0
}

chknfs
#!/bin/sh
# rundiags.sh - Run local diagnostic checks
PROGRAM=${0##*/}
TOPPID=$$
DIAGS=$1 # We pass in the location of the directory as a variable

echo -n "`hostname`: "

cd $DIAGS
for d in *.diag; do
    chmod 0755 $DIAGS/$d
    $DIAGS/$d
    if [ $? -gt 0 ]; then
        echo "Error found on ${d}
    fi
done

echo " "

exit 0

The Script to be Run from the Monitoring Host

#!/bin/bash
# remote_diags.sh - Deliver, execute and remove node diagnosis script(s)
# usage           - remote_diags "host1 host2 host3 ..."

trap interrupt INT

PROGRAM=${0##*/}
TOPPID=$$
DIAGS_LOCAL="/var/adm/remote-diags" # Where we have dns.diag, nfs.diag,
                                    # rundiags.sh and perms.diag
DIAGS_REMOTE="/var/tmp/diags"       # Temporary spot on remote host

bomb()
{

    cat >&2 <<ERRORMESSAGE

ERROR: $@
*** ${PROG} aborted ***
ERRORMESSAGE
    kill ${TOPPID}      # in case we were invoked from a subshell
    exit 1
}

interrupt()
{
    echo "Trapped keyboard interrupt - exiting."
    exit 0
}

isalive()
{
    host=$1

    ssh $node exit

    if [ "$?" -gt 0 ]; then
        echo "cannot connect to ${host} - skipping"
    fi
}

for remote_host in ${1} ; do
    isalive $remote_host
    scp -r $DIAGS_LOCAL $remote_host:/$DIAGS_REMOTE ||
        bomb "Could not scp ${DIAGS_LOCAL} to ${remote_host}:/${DIAGS_REMOTE}"
    ssh $remote_host $DIAGS_REMOTE ||
        bomb "Could not remote execute ${DIAGS_REMOTE}"
    ssh $remote_host "rm -rf $DIAGS_REMOTE"
        bomb "Could not rm ${DIAGS_REMOTE}"
done

exit 0

Summary

The examples are rudimentry at best, however, they should serve as a good starting point for setting up server check processes that may not be easily accessible using an off the shelf product. It is worth noting that the examples (all of them) can easily be integrated into products like Hobbit.