Aug 2009

Configuring High Availability Linux

Many Linux distributions ship with the heartbeat suite of software for setting up High Availability Linux. The Linux HA project has details and downloads for those who do not have it available for their system. This text addresses setting up a very simple HA Linux configuration using the configuration files versus a GUI or the XML definition files.

Setup

The example setup will have two servers that serve up an apache webserver. One can have many other services assigned as well and shared data over NFS for example. For instance, if the failover was for an apache server where htdocs was sitting on storage it could be like so:

mass storage/NFS/iSCSI/Fibre
         |
server1:/mnt/htdocs server2:/mnt/htdocs

Similarly a common mysql db backend could be available or even more exotic tiered mysql dbs - basically whatever the needs are. What Linux HA can do is using a shared IP it can host the same IP from any server in the cluster list. For demonstration purposes, however, the apache servers root will have an index file with the actual hostname of the system - what should be observed is the index file contents will change after a failover but still be accessible via the shared IP.

Following are the hostnames and ipv4 addresses that will be used:

  • 192.168.1.15 prime (sles11 webserver)
  • 192.168.1.16 calc (sles11 webserver)
  • 192.168.1.20 sigma (ha address)

Installation & Test Setup

SuSE, RedHat and Debian all support the heartbeat packages from the Linux HA project. Since the example is SuSE 11 the syntax is:

yast -i heartbeat

A basic apache server for the test is required as well:

yast -i apache2

To illustrate the test, a simple page on each webserver with its hostname can be used and put into /srv/www/htdocs/index.html:

<html><head></head<body>prime</body></html>
<html><head></head<body>calc</body></html>

Next - startup and set to start at boot the webservers (run on both systems):

service apache2 start
chkconfig apache2 on

Now time to test the systems separately with lynx --dump:

# lynx --dump prime
   prime

# lynx --dump calc
   calc

Last and not least the hosts must be able to resolve to eachother by name. Host file entries work fine for this.

HA Configuration

Configuring the HA services is relatively simple. This configuration is very basic; three files are needed:

  1. /etc/ha.d/ha.cf: protocol, server options and servers.
  2. /etc/ha.d/authkeys: shared keysfile
  3. /etc/ha.d/resources: resource definitions

ha.cf

For the example setup the ha.cf file looks like the following:

debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility     local0
keepalive 2
deadtime 10
udpport 694
bcast     eth0
node    calc
node    prime

The above options are pretty straightforward; where the debuglog is, logfile, what level, tcp keepalive in seconds, deadtime in between in seconds, what udp port, what interface to broadcast on then the nodes in the cluster.

authkeys

The documentation explains the various options but for this example there is one md5, so the authkeys file has the following in it:

auth 1
1 md5 86e07a217fcd61fb981872ec57b68845

The sum was generated by simply echoing a string and piping it to md5sum. Also the authkeys file must be read only root:

chmod 0600 authkeys

haresources

The resources file dictates the shared address and services in init to startup (or shutdown as the case may be):

prime 192.168.1.20 apache2

The starting or primary server is put as the first argument. Now the the configuration is done on the primary server - the exact same settings can be used on the secondary one.

Firing it Up

Starting up is pretty simple:

# chkconfig heartbeat on
# service heartbeat start
Starting High-Availability services2009/07/25_21:04:30 INFO:  \
        Resource is stopped
heartbeat[4071]: 2009/07/25_21:04:30 info: Version 2 support: false
heartbeat[4071]: 2009/07/25_21:04:30 info: **************************
heartbeat[4071]: 2009/07/25_21:04:30 info: \
        Configuration validated. Starting heartbeat 2.99.3

Now a litmus test of the shared address:

#  lynx --dump 192.168.1.20
   prime

Testing

Testing can be a little tricky - the simplest way is to stop the heartbeat service on the active node and let the other one take over, observe the log entries on the calc node:

IPaddr[5106]:   2009/07/25_21:32:55 INFO: eval \
        ifconfig eth0:0 192.168.1.20 netmask 255.255.255.0 broadcast 192.168.1.255
IPaddr[5089]:   2009/07/25_21:32:55 INFO:  Success
ResourceManager[5006]:  2009/07/25_21:32:55 \
        info: Running /etc/init.d/apache2  start
mach_down[4980]:        2009/07/25_21:32:58 info: \
        mach_down takeover complete for node prime.
heartbeat[4241]: 2009/07/25_21:33:05 WARN: node prime: is dead
heartbeat[4241]: 2009/07/25_21:33:05 info: Dead node prime gave up resources.
heartbeat[4241]: 2009/07/25_21:33:05 info: Resources being acquired from prime.
heartbeat[4241]: 2009/07/25_21:33:05 info: Link prime:eth0 dead.
harc[5258]:     2009/07/25_21:33:06 info: Running /etc/ha.d/rc.d/status status
heartbeat[5259]: 2009/07/25_21:33:06 info: \
        No local resources [/usr/share/heartbeat/ResourceManager \
        listkeys calc] to acquire.
mach_down[5287]:        2009/07/25_21:33:06 info: \
        Taking over resource group 192.168.1.20
ResourceManager[5313]:  2009/07/25_21:33:06 \
        info: Acquiring resource group: prime 192.168.1.20 apache2
IPaddr[5340]:   2009/07/25_21:33:06 INFO:  Running OK
mach_down[5287]:        2009/07/25_21:33:07 \
        info: mach_down takeover complete for node prime.

And a quick check with lynx:

#  lynx --dump 192.168.1.20
   calc

Note that once prime is back online that calc gives control back:

ResourceManager[5515]:  2009/07/25_21:33:43 info: \
        Releasing resource group: prime 192.168.1.20 apache2
ResourceManager[5515]:  2009/07/25_21:33:43 info: \
        Running /etc/init.d/apache2  stop
ResourceManager[5515]:  2009/07/25_21:33:44 info: \
        Running /etc/ha.d/resource.d/IPaddr 192.168.1.20 stop
IPaddr[5592]:   2009/07/25_21:33:44 INFO: ifconfig eth0:0 down
IPaddr[5575]:   2009/07/25_21:33:44 INFO:  Success

Summary

The example shown here is very primitive, ideally there would be a dedicated nic on each machine for this function or a serial connection between them. Also the services are not exactly exotic but this should be enough to get a Linux HA setup off the ground.