Skype As A Lesson

Skype is currently having some sort of “software problem” which is apparently limiting them to about 30% of their normal service. This is one more example of how “the old ways” are sometimes better, or at least some of the older philosophies toward system reliability. I would not rely on a service like Skype for phone service into my house or business. While my Internet connection into the house has been reasonably reliable, it is still only about 90% reliable, so I also wouldn’t depend on it for phone service. Even my mobile phone, while far more reliable than either of those other services, still sees occasional drops and outages. But, I would still depend more on a mobile phone than any of the current Internet-based services, simply because those services do no seem to be run by people who know how to run reliable phone service. For truly reliable voice communication, for now the standard is POTS.

The Standard for Phone Service

How reliable is a regular POTS (Plain Old Telephone Service) line supposed to be? When I worked at Bell-Northern Research, the one goal I had heard a few times was no more than 40 minutes of downtime in 40 years. That would equate to an uptime of 99.9998% uptime. What happens to make that type of goal realistic? It is a number of things that make it possible.

First is the design and architecture of the hardware and software. Telephone switches are special-purpose devices, with built-in redundancy in many, many places. They are built to withstand not just run-of-the-mill type problems like heat and cold, but also withstand the presence of moisture and severe physical and mechanical events (the guys at Product Integrity at BNR had the coolest job. They had an earthquake simulator, that they would bold a switch to, start up call load testing, then turn it on. The switch processed thousands of calls per minute while shaking and waving about under a magnitude 5 quake). The switches include massive banks of batteries that can run the switch for a few days, and that is couple with the usual infrastructure for power to keep them running for a long, long time without the mains.

The software is designed and coded to be equally robust. Again, redundancies, a focus on error testing and error correction as well as some focus on resource usage make for software that is quite reliable.

The next major step is testing. This is both software testing and hardware testing. At BNR, there were as many testers as there were developers, and developers had to do an extensive amount of their own testing before the code could be submitted for formal testing. Code was reviewed extensively. Every line of code, and every possible code path had to have a test case. It was exhaustive, but it was also tedious, at least when viewed by an outsider. It certainly didn’t make for a high level of productivity (when I was there in the early 1990’s, the one metric I heard was that a developer would spend maybe 1 day actually coding, and the rest was spent in reviews, testing and documentation), and it was a process that needed a rethink. But it did result in code that simply just worked, since people’s lives depended on it.

Hardware testing, as I alluded to earlier, was also quite rigorous. Switches were tested, powered and under load, in special environmental chambers that could simulate extremes of temperature and humidity. There was the earthquake simulator. There was even a truck simulator, where the switch would be crated and packaged for shipment, loaded on the “test truck” and then “driven” down a variety of roads for hours or days. The switch was then uncrated and every part inspected to see how it survived shipment, and identify areas to improve durability.

The Level of Rigor For Internet Services

Let’s face it, we all know how reliable most Internet-based services are, or more to the point, are not. The fact that Skype is now effectively out for 24 hours and counting does not speak well for their uptime numbers. This outage, for a single year, would reduce their uptime to about 99.72% (as of this writing, it gets worse the longer it goes on). If you are Amazon or eBay, that level of outage may be acceptable (or at least tolerable) because it isn’t as if lives may be at stake. But to accept that as adequate for voice communications, particularly if that is the only one you have in your house, is compromising far too much. Fortunately,  Skype doesn’t promote itself as a home phone service. But their outage isn’t atypical for Internet-based services. Imagine if you had an Internet phone, and you needed to call 911 to get help. But, you pick up the handset, and no dial tone, because this is the hour that your cable-based Internet happens to be out. Or the servers that provide the IP telephony support are down because of a software defect or a Denial of Service attack. These systems are vulnerable to a variety of issues that telephone switches generally don’t have to face.

Most Internet-based services have a far lower standard when it comes to the rigor of code inspection and testing, and for most of those services, that is acceptable. The business need to rapidly add features and deploy them for use is far greater than the availability numbers. People will get upset if a web-based eCommerce site is down and they can’t order their book, movie or music, but they aren’t depending on them for essential services. Not all sites need to aim for five-9s-plus uptime levels.

But in my mind, anything less than the current reliability level for POTS service for a phone is unacceptable. That means that, to achieve that, companies offering these types of services need to stop thinking about how rapidly they can push out a feature, and think more about how much time they need to spend on coding and testing.

Will It Always Be This Way?

I am not naive or curmudgeonly enough to think that this is how it will always be. Telephone service in its early days wasn’t nearly as reliable as it is now. Equipment failures, operator errors and environmental factors could cause phones to be out for hours or days in the first few decades that phone service was introduced. The level of reliability we have now didn’t start to appear until the 1940’s and 1950’s. It took time and experience to build to where we are today.

The same will apply to Internet-based services. At some point, the highspeed Internet I get from my cable provider will also achieve these higher reliability numbers because it will simply have to. Between consumer demand and regulation, the cable and phone companies will have no choice. At some point the people working on the software will come to realize that any bug, no matter how innocuous, is unacceptable. There are other industries that face similar issues. Financial markets systems have fairly stringent uptime requirements, and some certainly meet or exceed them. Life-sensitive systems like avionics systems have even higher standards when it comes to reliability. In time, services like Skype will come to realize that reliability of the service will become more important than adding some new feature.

We are still in the early days of these services. But for now, I will still have a POTS line, even if I decide to add Internet-based phone service. I fully expect some day that POTS will go away, but that time is still somewhere in the future.


2 thoughts on “Skype As A Lesson

  1. Hey .. many thanks for your awsome post .. i discovered it by simply browsing on yahoo. I allready bookmark it and also wish to view more great posts from you! Cheers 🙂

Comments are closed.