Dec 2014

Using the timeout Command

Ever had an automated secure copy hang on you? Or, better yet, how about a crond secure copy job that hangs each time it is called and you happen to be out of the office for a couple days while it is called once an hour? Sure that never happens.... well it did and the fix, hopefully, was relatively simple.

The Setup

Before we bother getting into why I am not using something like puppet I will be real clear; I don't need to. The use case is a cluster that network installs nodes using a common kickstart and a (rather small actually) post script which sets up all the stuff each node needs. All of that being said, the only files that need to be kept in sync are the pwdb files, in this case, this example will work for any files. Also, why not use a parallel copy? Because the version of pdcp that we have does not have verbosity... we wants verbosity.

The Original Script

#!/bin/bash

source /etc/profile

# This shlib contains a function we get a list of nodes from
if [ -f ./sgc-shlib ]; then
  source ./sgc-shlib
else
  source /usr/local/sbin/sgc-shlib
fi

# use get_sge_exec_hosts() for an ordered list
for n in `get_sge_exec_hosts`; do
  for f in /etc/passwd /etc/shadow /etc/group /etc/gshadow ; do
    scp $f $n:$f
  done
done

Probably pretty easy to see where the glaring issue is. So why did it take so long to notice? Oddly, the regular timeout for ssh prior to this particular cluster (on a previous one) was sufficient. For some reason, it was taking a lot longer. Plus, it is possible that a node that is wedged would go half open which ... is just no fun at all.

Another problem with this script is that it is not very verbose at all. So at first I slapped in an echo before the scp command which turned out to be a really stupid idea, something like this:

echo "copying $f to $n:$f

Which didn't tell me squat and in fact lied.

Iterations

So pass one was simply adding the timeout command with no other changes:

    timeout scp $f $n:$f

Well... that kind of helped but it was not very explicit nor verbose. So I did a little reading and like any command timeout has return values: 0 for okay and 124 for utter failure. Probably some signals in between too but for my part all I cared about was did it work one hundred percent? Also the default timeout of 5 seconds didn't seem to work well for my network... I have no clear explanation as to why. So here is round 2:

for n in `get_sgc_exec_hosts`; do
  for f in /etc/passwd /etc/shadow /etc/group /etc/gshadow ; do
    /usr/bin/timeout 3s scp $f $n:$f
    if [ $? -ne 0 ]; then
      echo "Copy of $f to $n failed"
    else
      echo "$f ===> $n:$f"
    fi
  done
done

Works like a champ (I just so happened to have some busted nodes to use as a case).

Where else can this go?

I haven't done it in production yet but another use case snuck up on me. It could be, and often is, that an exec node in the cluster is just plain too busy when a connection is requested. This is why, in hindsight, it is cool that I did not put a continue or jmp in the loop because if the case is a high load, maybe I want to take another shot at it? So this scenario kind of opens a can of worms for two reasons:

  1. So how should the timers and failure numbers look?
  2. Why not check load via snmp (or some kind of pre flight check) first?

Lets take the easy one first, item 1: that is totally subjective. For my environment because it is relatively small (only about 128 nodes) our tolerance is probably a little higher. So we could use a sleep and try again:

scp_file()
{
	hst=$1
	file=$2

	/usr/bin/timeout 3s scp $file $hst:$file
	if [ $? -ne 0 ]; then
      echo "copy of $file to $hst failed sleeping to try again"
       /usr/bin/timeout 3s scp $file $hst:$file
    fi

    if [ $? -ne 0 ]; then
      echo "copy of $file to $hst failed"
    else
      echo "$file ===> $hst:$file"
    fi
}

for n in `get_sgc_exec_hosts`; do
  for f in /etc/passwd /etc/shadow /etc/group /etc/gshadow ; do
    scp_file $n $f
  done
done

There are many different ways to approach that, the untested example above is just one of them!

As for a pre-flight check, if snmp is installed and there is an easy to use client side parser for it already available (say like a nagios call out) then sure - go for it - but a single timeout or multi-looped timer one should be sufficient for most cases. Again, this is all kind of fringe and your results may vary ... wildly.