One of the advertised benefits of cloud computing is high availability and redundancy. Back in April, however, some of Amazon’s cloud storage services suffered an outage that lasted for about three days, bringing down websites of several high-profile customers.
The initial problem was quickly fixed, but oddly enough, the extended outage was caused by the cloud management software attempting to prevent the loss of data. Amazon essentially performed a denial-of-service attack on its own storage servers which took three days to fix.
This event brings up an inherent problem with cloud computing: complexity. As a programmer, I know that error-handling code tends to go untested (or has minimal testing), just because it can be difficult to create the errors necessary to exercise the code, or because it takes too much time and money in a competitive business environment. It’s obvious that Amazon did not test for the type of situation that occurred on April 21st. The linked article makes the argument that cloud computing systems have much more complexity than would the individual systems in a non-cloud environment. So cloud providers, to prevent these types of outages from happening in the future, will have to learn how to better deal with complexity.
Thanks to Josh for this topic.