House Calls


New AWOS sites online in Virginia!

Hello everyone,

In my last post on June 6, you saw the list of new METARs that came online with the FAA database “reset” day. Since then, several new AWOS sites have come online in Virginia. New ones that came online over the last two days include:

KBKT – Fort Pickett / Blackstone, VA
KLVL – Lawrenceville, VA
KW81 -Crewe, VA

One more is due soon:

KW31 – Kenbridge, VA

Watch for that to come online soon, plus any others around the country that come online as well.

Time to do a reset!

Hello everyone,

Gilbert here. I hope you have thawed out from this ridiculously cold winter!

Today, we begin a monthly series on weather or weather-related topics. Sometimes, we’ll take you behind the scenes of how the weather “data” machine works, like today.

When you look at GRx placefiles, or Pykl3, for example, you can see the METAR reports plotted on your screen: these are the weather reports sent from the various airports throughout the country. METAR stands for Meteorological Terminal Aviation Routine Weather Report. These typically give the temperature, dewpoint, and wind direction and speed; many give sky cover, visibility, pressure, and any precipitation that is falling. Some staffed sites provide comments on the sky condition that gives a more in-depth view of what the observer is seeing.

In the United States, these weather reports are routed by the FAA to something called NADIN. NADIN is a secure, transmission protocol for METAR reports, and the National Weather Service takes METARs from that feed and rebroadcasts it to the world via satellite. Data providers may also get the feed directly, as we do at AllisonHouse.

Every 56 days, the FAA updates its database so that any new sites that have come online in that time period can start reporting METARs. Otherwise, and until that point, they get filtered out! It sounds crazy, but that’s how the FAA does it. If a station starts reporting METARs, and is already in the FAA database, then there’s no problem; they are sent on the feed immediately. Otherwise, they have to wait until that 56 day period is over before they can be added. More interesting: Once the new sites are added on the reset day, it can take weeks, months or even years for the National Weather Service to add them to their data feed! That’s why we, at AllisonHouse, get them directly from the FAA NADIN feed. Once they are available, we plot them for you immediately!

This “reset” of the database happened on June 5, 2014. And, a bunch of new METAR sites were added! Some were also in the database and added just before the reset. Here are 29 new sites to look for now, with this list compiled nicely by Boris Konon (thanks, Boris!):

0V4 – Brookneal VA
1A9 – Prattville AL
1L0 – Reserve LA
3N8 – Mahnomen MN
4I7 – Greencastle IN
8W2 – New Market VA
CXE – Chase City VA
ELK – Elk City OK
FYM – Fayetteville TN
GNF – Grenada MS
GUR – Guernsey WY
GVE – Gordonsville VA
LCQ – Lake City FL
MEV – Minden NV
MOR – Morristown TN
MYZ – Marysville KS
PAAD – Deadhorse/Point Thomson AK
PAUT – Akutan AK
PHT – Paris TN
RCV – Del Norte CO
RNC – McMinnville TN
RZT – Chillicothe OH
SNH – Savannah TN
SXS – Fort Rucker/Shell AL
SYI – Shelbyville TN
W75 – Saluda VA
W78 – South Boston VA
W96 – Quinton VA
Y49 – Walker MN

These are all in our database, and you should see them now on all GRx placefiles. Enjoy, and watch for new ones from Tennessee and Kentucky, amongst other places, in the future!

April 28, 2014 Outage Analysis

April 28, 2014 Outage Root Cause Analysis

Summary & Root Cause
Beginning at 23:00Z on April 28th, 2014, AllisonHouse experienced an unexpected service outage. This outage was unique because it occurred during a period of heavy use and while our provider was experiencing a separate, unrelated outage. This outage extended through 23:58Z with full recovery being achieved by 00:15Z on April 28th, 2014.

A critical server became overloaded during an active weather period. Memory was exhausted, processes hung and the server became inaccessible by normal means. Our SLA with our Provider guarantees a human will begin working to restore service within 5 minutes of assignment. Due to the outage experienced by our Provider, they could not reach their customer’s servers for upwards of 50 minutes. By the time our Provider was able to reach our server, we had gained access through normal means and were in the process of rebooting the machine.

Why was this outage SO LONG?!
Our standard policy is to be on top of outages the minute we receive notification from our monitoring systems. Notifications were received within 45 seconds of the first radar file being inaccessible to our pool of web servers. On call staff responded immediately. After first responders were unable to issue a ticket to the Provider through the Customer Portal, a hotline was dialed which put us in touch with the Managed Hosting Systems Administrator (MHSA) in charge of our account. (This person is familiar with the ins-and-outs of the AllisonHouse infrastructure and the unique demands of our business.)

MHSA’s response was that he too was unable to reach the Customer Portal. This Portal serves as a central repository for all kinds of information for the Provider. This includes IPs, usernames and passwords to our servers, as well as our “Recovery Interface,” which is a separate card in each machine that allows Remote Console access in the event the machine is inaccessible through normal means. MHSA was forced to end the phone call to determine ETA on Customer Portal accessibility so that he could reach the server and recover it. Phone calls were extended up the chain of command while AllisonHouse employees continued to try to recover the critical server through standard access methods.

AllisonHouse’s outage procedure calls for routing around a failed server within 15 minutes of the first notice of an outage from our Global Monitoring Systems. Our inability to route around the server stemmed from Provider’s Customer Portal outage, where our DNS can be updated. This meant the ONLY solution was to recover the server to restore service. This could have happened one of two ways;

  1. Customer Portal returned so that Provider’s MHSA could access the server via its Recovery Interface.
  2. We were able to reach the server through standard access methods and the server responded to our commands.

We were able to reach the server at 23:43Z and get it to respond to a restart command, 5 minutes after the command was sent. A phone call was received by MHSA at 23:45Z that stated Provider’s Customer Portal had returned to service.

What are you doing to make sure this never happens again?
1. DNS will be moved away from Provider to de-centralize it. This will allow changes to be made during the unlikely case that both Provider’s Customer Portal and an AllisonHouse server goes down.
This is to be completed within 30 days of RCA being published. Post will be updated to notify AllisonHouse customer’s of completion.

2. Provider was notified that their Managed Hosting Policy was no longer acceptable to AllisonHouse’s Business Operations. Provider was told to relinquish Recovery Interface credentials (Interface IP, Username, Password, etc.) to AllisonHouse in order to de-centralize the information. Provider’s Managed Hosting Policy is being amended to internally de-centralize the information so that MHSAs are able to access the Recovery Interfaces during a Customer Portal outage.
Provider sent Recovery Interface credentials on Thursday, May 1st, 2014. Credentials were tested and added to AllisonHouse’s standard recovery procedure. Provider’s Managed Hosting teams will internally de-centralize Recovery Interface credentials on their own accord.

3. Provider’s monitoring system relies too heavily on their Customer Portal. MHSA was unaware of outage when he was reached via the phone hotline. Provider was sent several stern emails outlining how they had failed us, their customer, and were continuing to do a disservice to their other customers by allowing such a system to impact them so heavily.
Suggested provider has backups for such a critical system.

Conclusion
Services Impacted:
Level II/III Radar, Warnings, Feeds (Incl. Lightning), GRearth, Website & Customer Portal
Outage Length:
58 Minutes – Down Notification to Up Notification by Global Monitoring Systems.