The Microsoft Cloud Briefly Went Down: A Good Reminder of Why the Cloud is Good Thing.

I was working on building and testing a Microsoft Flow a month ago. Thursday afternoon, May 2nd, to be exact. It was fairly complex and required a bit of tweaking, most of the runs were taking anywhere from 3 to 5 minutes to run end to end. I was running the latest (and I was hoping final) version, but it was taking longer, 7 minutes, 8 minutes, 15 minutes… then when I refreshed it wouldn’t come back up.

My wife (who is a Dynamics 365 consultant) sent me an instant message to see if any of my Dynamics 365/Power Platform environments were down. A quick investigation showed that both Canadian and United States (North America) Dynamics 365 instances were not responding.

A further check-in with the MVP community showed that it was a rare worldwide outage, affecting Dynamics 365, PowerApps, Flow, SharePoint, Azure and other Microsoft cloud services. For some reason, Exchange kept working.

Twitter lit up as well, and Microsoft was quick to respond and investigate the issue. One particular tweet by @stevegoodman caught my attention:

“When #Office365 goes down, millions of system admins the world over scream “see, our infrastructure is more reliable!”. You know what I get to do now? NOTHING IT WILL BE FIXED. You know what I had to do when on-premises infrastructure failed? STAY UP ALL NIGHT FIXING IT. “

Stuff happens. Electricity goes out. Sometimes the Internet goes down. Winter storms, floods, flu outbreaks, the list goes on of things that can disrupt a business. A Cloud outage is just one of many things.

Is it inconvenient? Yes, absolutely. Does downtime cost business money in terms of lost productivity? Without a doubt. Then why would a cloud outage be a “good thing”?

War Stories

Let us go back in time. 1997. I was working as a systems administrator at a company that specialized in (paper) office products and productivity training. We were running a Windows NT 3.51 running on a DEC Alpha. Its the end of the day. Boom. BSOD. That’s “Blue Screen of Death,” meaning Windows has crashed and displays a cryptic message on the screen.

Dreaded BSOD

In most cases, a restart will fix the issue, and Windows will come back up. Yes, Windows does come back up, but the Faircom database is now corrupt, meaning at least 12 hours of rebuilding the index files to get Great Plains Dynamics C/S+ back up and running. Imagine that and a phone ringing off the hook from the call center and the warehouse asking every 15 minutes when the system will be back up. The system was fixed and back up and running by 7:30am. The warehouse manager comes by and says; “You’re in early…” and notices the bloodshot eyes and the same clothes…

I know you haven’t slept in 48 hours but any idea when I can log in to Facebook again?

Fast forward to 2002. The System Administrator wants to update the Exchange Server (running Windows NT 4.0, no upgrade to Windows 2000 in the budget) with an operating system service pack. Full backups are done, staff are notified that email will be under maintenance for the evening but should be up in a couple of hours. One mistaken button click (the server has multiple processors) and Windows update attempt to apply a single processor service pack (as opposed to multiple processor service pack). Windows gets corrupted. Some issue with the Exchange backup. Thankfully, the Business Apps/Database Administrator (that’s me) who was also running some maintenance in this downtime remembers a hacker maneuver that involved installing Windows on a separate partition, allowing parallel bootup which enables the recovery of some very vital files. The system gets recovered and running at around 3am. The service pack is applied (correctly this time). By 5am, we roll out.

Today

Lets now go to May 2, 2019. Dynamics 365 Online and the Portal are down because of an apparent DNS issue on Microsoft’s cloud. The training manager needs to add a group of new students to the system. I send an email to the team, asking everyone to be patient. I sit back and focus on some other work. 20 minutes later, everything is back up and running. I start making dinner and crack a cold one.

The best recipe for a cloud outage

Summary

Anyone working in IT has stories from the trenches where things went off the rails. Many of us had to spend long hours in cold server rooms on the phone with support technicians potentially a world away working our way through the issues that were severely impacting a business. There are stories of heroics where the system gets recovered, and work continues. There are also stories where someone has to tell the CEO that thousands to millions of dollars worth of data are lost.

Whether you have a 10 person staff with some simple line of business apps to a 500 person call center processing thousands of orders a day, if the cloud goes down, whether it be Microsoft, Amazon, Google or Salesforce, these cloud companies will immediately deploy an army of engineers to isolate and resolve the problem, find the root cause, and have the resources to make sure it doesn’t happen again.  Many times before you even notice the issue.

This same army is also applying updates, security patches and bug fixes which means you don’t need to worry about inadvertantly applying the wrong service pack. This also means that end users have access to the latest and greatest version of the software.

What would you prefer? An army that will get your system back up and running in 20 minutes, or already overworked staff or expensive 3rd party techs working through the night (at overtime rates) in hopes that a system can be recovered by the start of business the next day with no data loss?

Army ready to fix the cloud, serving all businesses, large and small

Sometimes it useful to have that reminder. That’s not to say there are not be disasters with the cloud (and there are). However, any organization where IT is not a core business, but rather as an operating function should not have to worry about hardware, networking or trusting millions of dollars of data on rust-covered plastic tape.

Photo by redcharlie on Unsplash

Photo by Petr Sonnenschein on Unsplash

Photo by Chuanchai Pundej on Unsplash

Photo by amir shamsipur on Unsplash

Nick Doelman is a Microsoft Business Solutions MVP who has had his fair share of late night system recovery marathons in cold server rooms. He now questions the sanity of anyone considering buying an actual physical server.

2 thoughts on “The Microsoft Cloud Briefly Went Down: A Good Reminder of Why the Cloud is Good Thing.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s