I was working on building and testing a Microsoft Flow a month ago. Thursday afternoon, May 2nd, to be exact. It was fairly complex and required a bit of tweaking, most of the runs were taking anywhere from 3 to 5 minutes to run end to end. I was running the latest (and I was hoping final) version, but it was taking longer, 7 minutes, 8 minutes, 15 minutes… then when I refreshed it wouldn’t come back up.
My wife (who is a Dynamics 365 consultant) sent me an instant message to see if any of my Dynamics 365/Power Platform environments were down. A quick investigation showed that both Canadian and United States (North America) Dynamics 365 instances were not responding.
A further check-in with the MVP community showed that it was a rare worldwide outage, affecting Dynamics 365, PowerApps, Flow, SharePoint, Azure and other Microsoft cloud services. For some reason, Exchange kept working.
Twitter lit up as well, and Microsoft was quick to respond and investigate the issue. One particular tweet by @stevegoodman caught my attention:
“When #Office365 goes down, millions of system admins the world over scream “see, our infrastructure is more reliable!”. You know what I get to do now? NOTHING IT WILL BE FIXED. You know what I had to do when on-premises infrastructure failed? STAY UP ALL NIGHT FIXING IT. “
Stuff happens. Electricity goes out. Sometimes the Internet goes down. Winter storms, floods, flu outbreaks, the list goes on of things that can disrupt a business. A Cloud outage is just one of many things.
Is it inconvenient? Yes, absolutely. Does downtime cost business money in terms of lost productivity? Without a doubt. Then why would a cloud outage be a “good thing”?
Let us go back in time. 1997. I was working as a systems administrator at a company that specialized in (paper) office products and productivity training. We were running a Windows NT 3.51 running on a DEC Alpha. Its the end of the day. Boom. BSOD. That’s “Blue Screen of Death,” meaning Windows has crashed and displays a cryptic message on the screen.
In most cases, a restart will fix the issue, and Windows will come back up. Yes, Windows does come back up, but the Faircom database is now corrupt, meaning at least 12 hours of rebuilding the index files to get Great Plains Dynamics C/S+ back up and running. Imagine that and a phone ringing off the hook from the call center and the warehouse asking every 15 minutes when the system will be back up. The system was fixed and back up and running by 7:30am. The warehouse manager comes by and says; “You’re in early…” and notices the bloodshot eyes and the same clothes…
Fast forward to 2002. The System Administrator wants to update the Exchange Server (running Windows NT 4.0, no upgrade to Windows 2000 in the budget) with an operating system service pack. Full backups are done, staff are notified that email will be under maintenance for the evening but should be up in a couple of hours. One mistaken button click (the server has multiple processors) and Windows update attempt to apply a single processor service pack (as opposed to multiple processor service pack). Windows gets corrupted. Some issue with the Exchange backup. Thankfully, the Business Apps/Database Administrator (that’s me) who was also running some maintenance in this downtime remembers a hacker maneuver that involved installing Windows on a separate partition, allowing parallel bootup which enables the recovery of some very vital files. The system gets recovered and running at around 3am. The service pack is applied (correctly this time). By 5am, we roll out.
Lets now go to May 2, 2019. Dynamics 365 Online and the Portal are down because of an apparent DNS issue on Microsoft’s cloud. The training manager needs to add a group of new students to the system. I send an email to the team, asking everyone to be patient. I sit back and focus on some other work. 20 minutes later, everything is back up and running. I start making dinner and crack a cold one.
Anyone working in IT has stories from the trenches where things went off the rails. Many of us had to spend long hours in cold server rooms on the phone with support technicians potentially a world away working our way through the issues that were severely impacting a business. There are stories of heroics where the system gets recovered, and work continues. There are also stories where someone has to tell the CEO that thousands to millions of dollars worth of data are lost.
Whether you have a 10 person staff with some simple line of business apps to a 500 person call center processing thousands of orders a day, if the cloud goes down, whether it be Microsoft, Amazon, Google or Salesforce, these cloud companies will immediately deploy an army of engineers to isolate and resolve the problem, find the root cause, and have the resources to make sure it doesn’t happen again. Many times before you even notice the issue.
This same army is also applying updates, security patches and bug fixes which means you don’t need to worry about inadvertantly applying the wrong service pack. This also means that end users have access to the latest and greatest version of the software.
What would you prefer? An army that will get your system back up and running in 20 minutes, or already overworked staff or expensive 3rd party techs working through the night (at overtime rates) in hopes that a system can be recovered by the start of business the next day with no data loss?
Sometimes it useful to have that reminder. That’s not to say there are not be disasters with the cloud (and there are). However, any organization where IT is not a core business, but rather as an operating function should not have to worry about hardware, networking or trusting millions of dollars of data on rust-covered plastic tape.
Nick Doelman is a Microsoft Business Solutions MVP who has had his fair share of late night system recovery marathons in cold server rooms. He now questions the sanity of anyone considering buying an actual physical server.