« mod_jk 1.2.27 Released | Main | BigIP: Botkilling iRule »

11/12/2008

BigIP: Weird ARP Problem

Have you ever failed over a BigIP or run configsync on your BigIP cluster and some of your VIPs are no longer reachable or even pingable once the failover operation has completed only to have it all suddenly working again after 4 hours? Well, it is not your fault (unless you also manage the switches on your network as well. In that case...*cough* *cough*)

I'm not usually one to go into detail on lower level protocol info because the slightest mistake would result in a litany of comments about how the entire article must be wrong because I used the term "sub net" instead of "subnet" or something similar. So I will try to keep this one short and sweet.

The symptoms of the problem have already been described: Failover or sync the configuration with the standby unit (which also has the interesting effect of causing a fast failover/failback) and when all is said and done, a minority of your VIPs are no longer reachable. They can not be pinged. Scratch your heads for a while and after 4 hours, all of a sudden those VIPs are back--working again as if nothing happened. If you were to view the arp cache on your primary switch (assuming all failover events are completed) you will ultimately see the mac address of the standby unit advertised for those IPs that are no longer pingable. Since the default for arp is to flush that cache every 4 hours, if you do nothing then those affected VIPs will be pingable again after 4 hours.

This is because you have spanning tree enabled on the switch port that your BigIPs connect to. Assuming you are running a cisco network, enabling portfast (or maybe even porthost) on the switch ports used by your VIPs should prevent the problem from ever happening again.

If you need to wait five years for someone in your network engineering groups to believe that it really is a switch problem (as I did) and you need an interim recovery solution, go into the "Virtual Address List" section under "Virtual Servers" in the BigIP Web Administration tool of the BigIP that is currently the primary unit. Then:

1. Select the failing IP Address from the list
2. Deselect the "Enabled" check box in the ARP section
3. Hit the "Update" button
4. Re-select the "Enabled" checkbox in the ARP section
5. Select "Update" again.

This will re-arp the mac address for your primary unit and overwrite what is in the arp cache on the switch. Never forget though that this is a bandage and that enabling portfast on those ports on the switch is the solution.

This article is primarily focused on how the BigIP is the victim but it could really be any load-balancer and the symptoms themselves don't really change. Some network event occurs and then some of your IPs are no longer reachable. 4 hours later, everything's great. Spanning Tree enabled on the pc/server port is the cause.

TrackBack

TrackBack URL for this entry:
https://www.typepad.com/services/trackback/6a01156fbc6fe6970c0115722882a8970b

Listed below are links to weblogs that reference BigIP: Weird ARP Problem:

Comments