Possibly one of the worst types of outages, which all of us have had or will have experienced in our careers, is the outage caused by an expired ssl cert:
Windows Azure Storage experienced a worldwide outage impacting HTTPS traffic due to an expired SSL certificate. HTTP traffic was unaffected but the event impacted a number of Windows Azure services that are dependent on Storage. We executed the repair steps to update the SSL certificate on the impacted clusters and availability was restored to >99% worldwide by 1:00 AM PST on February 23. At 8:00 PM PST on February 23, we completed the restoration effort and confirmed full availability worldwide.
I'll be looking to see if a root cause analysis ends up getting posted but I have a few guesses as to what could have caused this. According to this article on Techcrunch, Microsoft To Refund Windows Azure Customers Hit By 12 Hour Outage That Disrupted Xbox Live and the full blog post on the Windows Azure log I'm linking to in the title of this post, it took 12 hours to restore service to 99% of customers on clusters impacted by the expired cert which makes me think a good portion of that time was spent by the Ops folks trying to figure out what an unhelpful error message in the logs actually meant followed by many hours of importing updated certs (or restarting app servers caching the old certificates).