Nov 2008
Managing and monitoring medium sized installations of 3-400 Unix systems is by no means an easy task. There exist a number of Open Source tools heavy handed and relatively lightweight that can perform process, service and network monitoring like Hobbit and Nagios. Sometimes tools like Hobbit are either not needed or require additional cases to perform diagnostics. In this text two aspects of writing and running diagnostics is covered:
In the example three test cases are used:
The test cases; for the time being; are written as single functions that
return a result number (usually this can just be $? but later
it will be observed this may not always work).
chkperms()
{
local dirs2chk="/usr/local/global_scratch"
local perms=""
for dir in $dirs2chk ; do
perms=$(stat $dir | grep Access | head -n 1 |
awk '{ print $2}' | grep 777)
if [ $? -gt 0 ]; then
echo "filesystem: Permissions not set correctly on ${dir}"
chmod 0777 $dir ||
echo "Error! Cannot set 0777 on ${dir}"
fi
done
return 0
}
chkdns()
{
local test_entry="www.foo.com"
host $test_entry
if [ "$?" -gt 0 ]; then
return 1
fi
return 0
}
One might wonder - why not just return $?; simple - if the
caller is remote $? may be overwritten at the calling system.
chknfs()
{
nfs_fstab=$(grep "^[^#].*nfs" /etc/fstab |awk '{print $2}' 2>/dev/null)
nfs_mount=$(mount | grep nfs | awk '{print $3}' 2>/dev/null)
for i in $nfs_fstab; do
matches=0
for j in $nfs_mount; do
if [ $i == $j ]; then
matches=$(($matches+1))
fi
done
if [ "$matches" -eq 0 ]; then
echo "nfs: ${i} not mounted"
return 1
elif [ "$matches" -gt 1 ]; then
echo "nfs: ${i} is mounted multiple times"
return 1
fi
done
return 0
}
Several methods of delivering and executing the tests exist:
Option one is the most work and hence should be avoided. Option one also generates traffic. Options 2-3 actually have a common point - the scripts are on the local filesystem; to make things interesting method four will be used with the understanding that the only real difference between method four and two/three is the delivery. The quickest delievery method is to have the files either packed up in one file or in a directory. Since the connection is singular either method works.
The calling script will need to either have all of the tests in it or source the test files which then call themselves. Sourcing works well, however, each test function will have to have a self reference call and to make reading them in an extension is used (although any simple pattern could be used); for example the DNS test:
#!/bin/bash
# dns.diag
chkdns()
{
local test_entry="www.foo.com"
host $test_entry
if [ "$?" -gt 0 ]; then
return 1
fi
return 0
}
chkdns
The callout script can reside in the same directory as long as it does not have the diag extension. Now all of the tests can be called by simply doing:
... cd path_to_test_scripts for test_script in *.diag ; do ./$test_script done ...
Now it is time to flesh out the calling script. Remember this script will be on the local system. There is still one more script to write; the script that actually copies over the diagnostics.
#!/bin/sh
# rundiags.sh - Run local diagnostic checks
PROGRAM=${0##*/}
TOPPID=$$
DIAGS=$1 # We pass in the location of the directory as a variable
echo -n "`hostname`: "
cd $DIAGS
for d in *.diag; do
chmod 0755 $DIAGS/$d
$DIAGS/$d
if [ $? -gt 0 ]; then
echo "Error found on ${d}
fi
done
echo " "
exit 0
The delivery script has to do the lion's share of work. It needs to do the following:
rundiags.sh.
#!/bin/bash
# remote_diags.sh - Deliver, execute and remove node diagnosis script(s)
# usage - remote_diags "host1 host2 host3 ..."
trap interrupt INT
PROGRAM=${0##*/}
TOPPID=$$
DIAGS_LOCAL="/var/adm/remote-diags" # Where we have dns.diag, nfs.diag,
# rundiags.sh and perms.diag
DIAGS_REMOTE="/var/tmp/diags" # Temporary spot on remote host
bomb()
{
cat >&2 <<ERRORMESSAGE
ERROR: $@
*** ${PROG} aborted ***
ERRORMESSAGE
kill ${TOPPID} # in case we were invoked from a subshell
exit 1
}
interrupt()
{
echo "Trapped keyboard interrupt - exiting."
exit 0
}
isalive()
{
host=$1
ssh $node exit
if [ "$?" -gt 0 ]; then
echo "cannot connect to ${host} - skipping"
fi
}
for remote_host in ${1} ; do
isalive $remote_host
scp -r $DIAGS_LOCAL $remote_host:/$DIAGS_REMOTE ||
bomb "Could not scp ${DIAGS_LOCAL} to ${remote_host}:/${DIAGS_REMOTE}"
ssh $remote_host $DIAGS_REMOTE ||
bomb "Could not remote execute ${DIAGS_REMOTE}"
ssh $remote_host "rm -rf $DIAGS_REMOTE"
bomb "Could not rm ${DIAGS_REMOTE}"
done
exit 0
Below is a full revised listing of the example tests, calling script and the
remote execution
script for convienence:
/var/adm/remote-diags
# chkperms.diag
chkperms()
{
local dirs2chk="/usr/local/global_scratch"
local perms=""
for dir in $dirs2chk ; do
perms=$(stat $dir | grep Access | head -n 1 |
awk '{ print $2}' | grep 777)
if [ $? -gt 0 ]; then
echo "filesystem: Permissions not set correctly on ${dir}"
chmod 0777 $dir ||
echo "Error! Cannot set 0777 on ${dir}"
fi
done
return 0
}
chkperms
# chkdns.diag
chkdns()
{
local test_entry="www.foo.com"
host $test_entry
if [ "$?" -gt 0 ]; then
return 1
fi
return 0
}
chkdns
# chknfs.diag
chknfs()
{
nfs_fstab=$(grep "^[^#].*nfs" /etc/fstab |awk '{print $2}' 2>/dev/null)
nfs_mount=$(mount | grep nfs | awk '{print $3}' 2>/dev/null)
for i in $nfs_fstab; do
matches=0
for j in $nfs_mount; do
if [ $i == $j ]; then
matches=$(($matches+1))
fi
done
if [ "$matches" -eq 0 ]; then
echo "nfs: ${i} not mounted"
return 1
fi
done
return 0
}
chknfs
#!/bin/sh
# rundiags.sh - Run local diagnostic checks
PROGRAM=${0##*/}
TOPPID=$$
DIAGS=$1 # We pass in the location of the directory as a variable
echo -n "`hostname`: "
cd $DIAGS
for d in *.diag; do
chmod 0755 $DIAGS/$d
$DIAGS/$d
if [ $? -gt 0 ]; then
echo "Error found on ${d}
fi
done
echo " "
exit 0
#!/bin/bash
# remote_diags.sh - Deliver, execute and remove node diagnosis script(s)
# usage - remote_diags "host1 host2 host3 ..."
trap interrupt INT
PROGRAM=${0##*/}
TOPPID=$$
DIAGS_LOCAL="/var/adm/remote-diags" # Where we have dns.diag, nfs.diag,
# rundiags.sh and perms.diag
DIAGS_REMOTE="/var/tmp/diags" # Temporary spot on remote host
bomb()
{
cat >&2 <<ERRORMESSAGE
ERROR: $@
*** ${PROG} aborted ***
ERRORMESSAGE
kill ${TOPPID} # in case we were invoked from a subshell
exit 1
}
interrupt()
{
echo "Trapped keyboard interrupt - exiting."
exit 0
}
isalive()
{
host=$1
ssh $node exit
if [ "$?" -gt 0 ]; then
echo "cannot connect to ${host} - skipping"
fi
}
for remote_host in ${1} ; do
isalive $remote_host
scp -r $DIAGS_LOCAL $remote_host:/$DIAGS_REMOTE ||
bomb "Could not scp ${DIAGS_LOCAL} to ${remote_host}:/${DIAGS_REMOTE}"
ssh $remote_host $DIAGS_REMOTE ||
bomb "Could not remote execute ${DIAGS_REMOTE}"
ssh $remote_host "rm -rf $DIAGS_REMOTE"
bomb "Could not rm ${DIAGS_REMOTE}"
done
exit 0
The examples are rudimentry at best, however, they should serve as a good starting point for setting up server check processes that may not be easily accessible using an off the shelf product. It is worth noting that the examples (all of them) can easily be integrated into products like Hobbit.
(based on last 2 months log reports)