Email me when new posts are made to this blog

Internal Server Errors

Written by on November 16th, 2012.      0 comments

We've been having several customers getting the error of "Internal Server Error". Our systems administrator has been working on the issue and fixed up most of the affected websites. If you are seeing this error please hold tight as this will be working again soon.

UPDATE: All issues have been resolved and all websites are up and running now.

We have a report on the issues, from our products and infrastructure manager:

On friday we completed a migration of our inhouse release deployment tool's revision control system from subversion to github.  
 
This release, while well tested in our test environment,  and even on a smaller scale in product, had some unanticipated consequences when rolled out across our 100+ production servers, all of which compounded to cause cascading errors across our network:
 
(a) Since all servers needed to pull the complete version of our software again from the github servers, a massive spike in disk IO and network usage, correlated across all servers, which slowed everything down to the point of crashing
 
(b) A number of servers failed to install the release, due to differing server versions/operating systems having subtle incompatibility with the software used by the new release manager
 
(c) Due to (a), some release installs timed out
 
(d) Due to all these effects, our monitoring system (and the human's on the other end of it) were maxed out and took some time to discern all the factors that were causing errors (no single factor was responsible for all errors, and they were masking each other)
 
(e) The combination of these effects, combined with the high load throughout the system from (a), made it extremely time consuming to roll back the changes (normally a near-instant operation) and led to  race conditions where it was impossible to roll back since the release hadn't completed and likely was never going to.
 
Essentially, everything that could go wrong did go wrong.  Thankfully, since the github deployment has completed, it has performed well.  All these issues were one-off problem's related to the changeover and won't happen again.
 
 

Comments