Ever had an ipv4 network address that is supposed to migrate over via a
high availability mechanism simply not work or even stranger if there were
several addresses some do and some do not? An experienced network
administrator probably has seen mysterious non-migrating addresses, however,
within this context is presented a rather interesting
when it has been observed.
For simplicity two addresses will be used, the idea being if a service or server in part of a 2 node high availability cluster are detected down via a heartbeat check the node that is up takes over (unless it is the one already holding the addresses). There are some details that need to be presented as well:
Following is a very primitive drawing displaying the configuration:
GSS (forwards to 192.168.1.41 and 42) node A: 192.168.1.30 node B: 192.168.1.31 vip1: 192.168.1.41 vip2: 192.168.1.42
If node A goes offline node B assumes the two addresses and services associated with them. It is also important to note that these systems were not in production use which allowed for a great deal of time to troubleshoot.
Simply put when node A was powered down for testing the addresses would be reassigned on node B, however, they could not be accessed. Node B's physical address worked fine (192.168.1.31). Letting system administrator instinct take over the following steps were taken:
So came the time for a little more in depth sleuthing.
Since it looked like there was a virtual split brain, that is only one of the two addresses appeared to be working, the next logical step seemed to be running a traceroute. A trace to the working address seemed fine but a trace to the non-working address hung just past the global site selector. Pings showed a similar pattern, the working address would answer and the non working address would seem to not get any sort of response. Seeing the traceroute hang led to a suspicion, what if the network didn't think the address had moved? To validate this node A was brought back online but without the shared addresses, they remained assigned to node B. Another ping was kicked off using the non-working address while at the same time a tcpdump on node A looking for ICMP traffic generated by the ping command was fired up. The results were pretty clear, the ICMP requests showed up on node A even though the address was no longer assigned to it. Somewhere along the network path a device had the wrong information about where the IP address was residing.
Actually there was not a real error. The cisco global site selector (GSS) keeps a sticky mac address table. It updated one address properly but did not do the other one. Remember that in reality there were many more addresses being shared (roughly 20 or so) and only half of them appeared to function after node B took them over. Node A was returned to operation with the shared addresses, however, now the problem reversed itself. The previous non-working address worked and vice verse. What is interesting is this held true for all of the shared addresses.
The real solution to this particular issue would have been to reduce the timeout for those particular addresses or use some other failover mechanism (which is in fact what the solution ended up becoming). Unfortunately at the time the systems did need to come back online for application testing; pondering - how to fix it right now?
This is where the network mapper or nmap came into play. With the current issue being when node A was online and not all addresses had the current mac address in the global site selector's table - they needed to be updated without rebooting the global site selector. Using Nmap's mac spoofing a scan against the global site selector with the current interface's mac address did the trick, following is an example of what it looked like:
sudo nmap -e eth0 --spoof-mac xx:xx:xx:xx:xx:xx -sV -P0 gss_IP_address
xx:xx:xx:xx:xx:xx was the mac address of the physical
network card that the shared addresses were on and
was one of the two global site selector addresses.
While tools like nmap are great for seeing what is on a network or system
they can also be used as demonstrated not to just aid in troubleshooting
but actually help to fix issues as well.