All of us have probably heard of the saying "Five Nines" which means 99.999% system availability and is the mythical reliability target often quoted as a goal to achieve when running a computer system or service. There is a larger debate on what the number means, if it is only the "network" or if it should include applications, servers, etc.
I'm not going down that debate path today other than to state the obvious that the "more nines" the better. Instead I would like to use the standard availability table to describe one of the hidden realities that currently exists with mobile services. So first let's start with the table:
99.999% is 5 minutes of downtime per year, and 90% is 36.5 days of downtime per year.
So now the question is:
Would it be acceptable for a production mobile service to only be available 90% of the time?
My company has seen examples of this low level of service from even the most well known of companies. One of the most problematic areas seems to be SMS services and especially short code programs.
A short code is where you text message a special keyword or phrase like "Pizza 98065" to a number similar to a cell phone number. This number routes your keywords to an application which then returns some form of answer back to your handset. There are many examples of short code text messaging programs for looking up stock quotes, checking the weather, looking up an account balance, checking your airline flight, etc.
I think most people would have a "reasonable expectation" that when they send out a text message to a short code, some form of answer will be returned to them in a time-frame that is useful. If you are trying to look up your checking account balance so you know if it is possible to use your debit card in order to make a purchase, most people would expect the answer to come back in a few minutes or less, not hours from when the request was sent.
My company is able to monitor the performance and availability of any type of short code program. We were very surprised to see examples of short code services from major/giant/well-know companies where the success rate of the short code request is at the 90% level or lower. This means that over the course of the year, that service isn't working for over 1 full month of time.
To be fair, the types of common problems we are seeing aren't always coming from the actual application behind the short code, but instead problems are coming from the SMS aggregators processing the messages. The SMS aggregator is the intermediate party (company) sitting between the network operator (AT&T/Sprint/Verizon/T-Mobile) and the actual application that processes the keywords recognized by the short code.
When you vote via text message on a TV show, your text message flows from the network operator to a SMS aggregator, which then routes your text message to the owner of the application processing your vote. The message you get back saying "Thanks for your vote...." follows the reverse path, from the application owner, to the aggregator, and then back on to the operator network and then finally to your mobile device. There are many other examples of short code promotions that follow this same model.
One of the most common problems we have seen is that the message reply "never comes back", which means you sent your text to a short code but you never receive any type of reply in a reasonable amount of time. Your message has gone into "limbo".
Here is an example graph from a short code service that allows you to submit general information queries via a short code. The graph lines "dipping down" toward the bottom indicates that the success rate of the service is dropping. In this case the last 1 weeks worth of data shows that this service is only successful 58% of the time on average. And it is from a service that just about all of us would recognize.
We are currently advising our customers not to assume that their SMS services are running at 99% or higher. In reality, very few of the ones we have seen are running at this level. Many are running down in the 90% range and a few like the one I have shown here has major problems that need to be fixed.
If you are concerned about the availability of your mobile services, it is important to develop some type of strategy that will give you visibility into what is happening in the real world. I'm sure services like the one above went through extensive QA testing, but once a service is released out into the hands of real users, you might get a different result than what you established during pre-production testing.
I'm not going down that debate path today other than to state the obvious that the "more nines" the better. Instead I would like to use the standard availability table to describe one of the hidden realities that currently exists with mobile services. So first let's start with the table:
99.999% is 5 minutes of downtime per year, and 90% is 36.5 days of downtime per year.
So now the question is:
Would it be acceptable for a production mobile service to only be available 90% of the time?
My company has seen examples of this low level of service from even the most well known of companies. One of the most problematic areas seems to be SMS services and especially short code programs.
A short code is where you text message a special keyword or phrase like "Pizza 98065" to a number similar to a cell phone number. This number routes your keywords to an application which then returns some form of answer back to your handset. There are many examples of short code text messaging programs for looking up stock quotes, checking the weather, looking up an account balance, checking your airline flight, etc.
I think most people would have a "reasonable expectation" that when they send out a text message to a short code, some form of answer will be returned to them in a time-frame that is useful. If you are trying to look up your checking account balance so you know if it is possible to use your debit card in order to make a purchase, most people would expect the answer to come back in a few minutes or less, not hours from when the request was sent.
My company is able to monitor the performance and availability of any type of short code program. We were very surprised to see examples of short code services from major/giant/well-know companies where the success rate of the short code request is at the 90% level or lower. This means that over the course of the year, that service isn't working for over 1 full month of time.
To be fair, the types of common problems we are seeing aren't always coming from the actual application behind the short code, but instead problems are coming from the SMS aggregators processing the messages. The SMS aggregator is the intermediate party (company) sitting between the network operator (AT&T/Sprint/Verizon/T-Mobile) and the actual application that processes the keywords recognized by the short code.
When you vote via text message on a TV show, your text message flows from the network operator to a SMS aggregator, which then routes your text message to the owner of the application processing your vote. The message you get back saying "Thanks for your vote...." follows the reverse path, from the application owner, to the aggregator, and then back on to the operator network and then finally to your mobile device. There are many other examples of short code promotions that follow this same model.
One of the most common problems we have seen is that the message reply "never comes back", which means you sent your text to a short code but you never receive any type of reply in a reasonable amount of time. Your message has gone into "limbo".
Here is an example graph from a short code service that allows you to submit general information queries via a short code. The graph lines "dipping down" toward the bottom indicates that the success rate of the service is dropping. In this case the last 1 weeks worth of data shows that this service is only successful 58% of the time on average. And it is from a service that just about all of us would recognize.
We are currently advising our customers not to assume that their SMS services are running at 99% or higher. In reality, very few of the ones we have seen are running at this level. Many are running down in the 90% range and a few like the one I have shown here has major problems that need to be fixed.
If you are concerned about the availability of your mobile services, it is important to develop some type of strategy that will give you visibility into what is happening in the real world. I'm sure services like the one above went through extensive QA testing, but once a service is released out into the hands of real users, you might get a different result than what you established during pre-production testing.
Comments