Yesterday, at around 3:45 PM EST, users of Microsoft’s Azure cloud computing platform began to experience problems world-wide. The problem apparently stemmed from an SSL certificate that had expired. The certificate was used by Azure storage service, and the problem had knock-on effects on other Azure services as well. The following message was posted on the Windows Azure Service Dashboard:
On Friday, February 22 at 12:44 PM PST, Storage experienced a worldwide outage impacting HTTPS traffic due to an expired SSL certificate. This did not impact HTTP traffic.
At the time I’m writing this, about 14:25 EST on Saturday, February 23, the Dashboard is still showing “Storage service degradation” across all regions. The most recent status update says:
We have executed repair steps to update SSL certificate on the impacted clusters and have recovered to over 99% availability across all sub-regions. We will continue monitoring the health of the Storage service and SSL traffic for the next 24 hrs. Customers may experience intermittent failures during this period.
Although there are many systems that have enviable records of reliability, occasional service outages are still something to be expected and planned for. In some cases, such as a natural disaster, it is possible to have considerable sympathy for the systems’ operators; forecasting rare events is difficult almost by definition (we assume the future will be like the past, because in the past, the future has been like the past).
It’s difficult for me to work up a lot of sympathy in this case, however. SSL cryptographic certificates have a well-defined expiration date. In addition, the certificate in question appears to have been issued by “Microsoft Secure Server Authority”; in other words, Microsoft was unable to get a timely renewal of the certificate from itself. If I were a customer of the Azure service, I would not be too happy right now.