As an IT Systems Manager in the trenches for the last 15+ years, I have (unfortunately) run into each of the most common IT outage issues. In fact, not only have I experienced each of these situations first hand, but in some cases I’ve had to handle multiple occurrences over the years. Statistically speaking, it won’t be a matter of if but when one of these scenarios occurs. So, when one (or more) of these common IT outages occur, the most important question to ask is: “Can your organization recover?” And if so, “How quickly will you be operational and back in business?”
Even if your systems are under warranty and in peak operating condition, hardware failures can occur at any time. Computers and network components are complex pieces of hardware containing hundreds if not thousands of intricate parts that work together. Each of those parts has to work correctly or an overall “system failure” can occur. In one batch of newly leased machines I had bad capacitors on the mother board – straight out of the box. While the manufacturer’s warranty covered these failures, I had to wait for new machines to arrive before I could replace those that had failed. Having an up-to-date warranty can give IT Managers some piece of mind, as well as having newer equipment. But ultimately, having a physical or cloud solution in place for quick recovery is the only way to ensure minimal downtime from an IT hardware failure.
Years ago I had one server crash hard and refuse to boot back up after running a routine windows update installation. It turned out that a conflict between some .dll file versions was the culprit. In this case, copying over the correct file versions in a command window and repairing some startup/boot files saved the day – but this won’t always fix your software failure. Putting computer monitoring in place can help administrators become aware of potential issues, before they halt the system. Having current and verified restore images for your physical or virtual servers and data is vital to recover from this type of IT failure.
About a year ago, the building my primary company is housed in was under construction to accommodate some new tenants. During this time the entire building experienced several months of intermittent brown-outs. As a result, we experienced damage to our some of our IT infrastructure, including the loss of a dozen or so office UPS units. After these repeated power fluctuations, one of the main data servers had operational interruptions so many times that some of the data became corrupt – including the restore images. Luckily, we had offsite copies of the server images and were able to restore the server back to its original state – data included. Proactive and ongoing infrastructure monitoring may help to prevent data corruption in certain situations as well.
Cloud Services Outage
So what if you lose your company’s connection to the internet – and possibly some of your vital operational cloud services? About 6 months ago, a bad rain and hail storm caused an outage with a transformer that supplied service for our Internet Service Provider (ISP). Therefore, we lost internet connectivity for several hours until it could be restored – which meant we lost connectivity to a shared web application and other web portals used to complete daily tasks. Our one saving grace was that we had a backup ISP in place, which allowed us access outside the building. With a redundant infrastructure setup through a mirrored firewall and an alternate ISP, our cloud connectivity was restored in half the time.
Loss of Power
And a final common IT outage – which continues to be an ongoing personal nemesis in my work life – is a loss of power to your business. Unfortunately, this issue is the cause of one of my most frequently experienced IT outages. About 6 months ago, I had an especially unusual experience with loss of power at one of the sister companies I support. The building had experienced a lightning strike to one of the roof mounted AC units. In turn, smoke filled the vents and we experienced a building wide power outage (and call to the fire department). This experience served as a catalyst for disaster recovery process improvements as we struggled through an outage lasting 3 days. We considered getting a generator large enough to power our warehouse operations, but it was still cost prohibitive. However – the ideal solution to this issue would be to have a secondary business location nearby or some sort of backup power source – if feasible for your organization.
Like this last scenario, there will be some situations that are simply out of your control and challenge your company’s ability to stay operational. However, with a solid disaster recovery plan that includes redundant hardware, up to date backups and system images, and secondary systems, if an everyday or major disaster strikes – your company is sure to be back in business in no time.Tags: IT Infrastructure