April 28, 2014 Outage Root Cause Analysis
Summary & Root Cause
Beginning at 23:00Z on April 28th, 2014, AllisonHouse experienced an unexpected service outage. This outage was unique because it occurred during a period of heavy use and while our provider was experiencing a separate, unrelated outage. This outage extended through 23:58Z with full recovery being achieved by 00:15Z on April 28th, 2014.
A critical server became overloaded during an active weather period. Memory was exhausted, processes hung and the server became inaccessible by normal means. Our SLA with our Provider guarantees a human will begin working to restore service within 5 minutes of assignment. Due to the outage experienced by our Provider, they could not reach their customer’s servers for upwards of 50 minutes. By the time our Provider was able to reach our server, we had gained access through normal means and were in the process of rebooting the machine.
Why was this outage SO LONG?!
Our standard policy is to be on top of outages the minute we receive notification from our monitoring systems. Notifications were received within 45 seconds of the first radar file being inaccessible to our pool of web servers. On call staff responded immediately. After first responders were unable to issue a ticket to the Provider through the Customer Portal, a hotline was dialed which put us in touch with the Managed Hosting Systems Administrator (MHSA) in charge of our account. (This person is familiar with the ins-and-outs of the AllisonHouse infrastructure and the unique demands of our business.)
MHSA’s response was that he too was unable to reach the Customer Portal. This Portal serves as a central repository for all kinds of information for the Provider. This includes IPs, usernames and passwords to our servers, as well as our “Recovery Interface,” which is a separate card in each machine that allows Remote Console access in the event the machine is inaccessible through normal means. MHSA was forced to end the phone call to determine ETA on Customer Portal accessibility so that he could reach the server and recover it. Phone calls were extended up the chain of command while AllisonHouse employees continued to try to recover the critical server through standard access methods.
AllisonHouse’s outage procedure calls for routing around a failed server within 15 minutes of the first notice of an outage from our Global Monitoring Systems. Our inability to route around the server stemmed from Provider’s Customer Portal outage, where our DNS can be updated. This meant the ONLY solution was to recover the server to restore service. This could have happened one of two ways;
- Customer Portal returned so that Provider’s MHSA could access the server via its Recovery Interface.
- We were able to reach the server through standard access methods and the server responded to our commands.
We were able to reach the server at 23:43Z and get it to respond to a restart command, 5 minutes after the command was sent. A phone call was received by MHSA at 23:45Z that stated Provider’s Customer Portal had returned to service.
What are you doing to make sure this never happens again?
1. DNS will be moved away from Provider to de-centralize it. This will allow changes to be made during the unlikely case that both Provider’s Customer Portal and an AllisonHouse server goes down.
This is to be completed within 30 days of RCA being published. Post will be updated to notify AllisonHouse customer’s of completion.
2. Provider was notified that their Managed Hosting Policy was no longer acceptable to AllisonHouse’s Business Operations. Provider was told to relinquish Recovery Interface credentials (Interface IP, Username, Password, etc.) to AllisonHouse in order to de-centralize the information. Provider’s Managed Hosting Policy is being amended to internally de-centralize the information so that MHSAs are able to access the Recovery Interfaces during a Customer Portal outage.
Provider sent Recovery Interface credentials on Thursday, May 1st, 2014. Credentials were tested and added to AllisonHouse’s standard recovery procedure. Provider’s Managed Hosting teams will internally de-centralize Recovery Interface credentials on their own accord.
3. Provider’s monitoring system relies too heavily on their Customer Portal. MHSA was unaware of outage when he was reached via the phone hotline. Provider was sent several stern emails outlining how they had failed us, their customer, and were continuing to do a disservice to their other customers by allowing such a system to impact them so heavily.
Suggested provider has backups for such a critical system.
Level II/III Radar, Warnings, Feeds (Incl. Lightning), GRearth, Website & Customer Portal
58 Minutes – Down Notification to Up Notification by Global Monitoring Systems.