Email me when new posts are made to this blog

Intermittent problem with some US websites

Written by Courtney Brown on August 16th, 2012.      0 comments

Several of our customers with US-hosted websites have been experiencing trouble with logging in to the Website Manager area. We've been investigating this issue for the past few days and we've now found out the problem and have fixed it.

At night the servers restart, when they do that they disable and then re-enable themselves one at a time. They also do this sometimes when they get load spikes to ensure the site itself doesn't get affected just because one server is broken or slow.   

For some reason the servers disabled but weren't re-enabling as they normally would. At first we put this down to a race condition and tried inserting some pauses and delays at key points. We also investigated the possibility that it was a bug in the apache webserver we use in parallel and tried to find a fix there. That led to us checking the stand-by backup frontend webserver to see whether we could apply an update and failover to it. This is where we found the real problem: an IP conflict.

In front of the server that runs the site is a load balancer, which directs requests to the appropriate webserver.  There are two load balancers, set up so that if one goes down (usually so that it can update itself, which happens regularly) the other one automatically takes over the IP address and starts serving webpages without interruption.

But in this case there was a breakdown in the connection between the servers and the load balancers - so they both thought they were active and both had the load balancer IP (it looks like the failover softwares died on one server so the other took over, a so called "split brain" issue). For it to work properly, only one can be enabled at a time. When both are enabled, it will confuse the servers - some users will see an error message while some will get onto the website fine.

Whether you get one server or the other for a particular request is semi-random, making the issue intermittent (and meaning that often you can check and it's fine but another user checking sees it broken). IP conflicts lead to extremely confusing and hard to understand behaviour because what happens is you intermittently see one server or the other, but with no real way to tell.

To resolve this we've forced the standby server back into standby mode, and we've added monitoring on the US servers to specifically check for IP conflicts, so that if this ever happens again we can resolve it right away without customers being interrupted.
Topics: Current Status
 

Comments