In the wake of the Nashville bombing tragedy on Christmas morning, many folks across Tennessee and Kentucky were left without internet and phone connectivity. 911 lines were down. Flights had to be cancelled. Even as I'm writing this, more than 36 hours later, there is still a significant number of people in that region who are still experiencing an outage. This could have easily been avoided. But how?
The telecom industry held us hostage for several decades. They created poor infrastructures and held monopolies on lines, poles, and regions. We still feel the negative impact of that today. In fact, let's consider our posture for an extreme scenario: what if this had been a natural disaster and people were desperately trying to find their loved ones, but simply could not locate them without contact? Here's my big question to you:
Why does the destruction of a single hub cause such wide-spread failure for so long?
There is absolutely no reason why this outage had to happen. And this is the basis of my writing. AT&T didn't have to move to the Cloud. They didn't have to implement the latest or hottest tech. No. It's much simpler than that. All they had to do was say, "What if...". It should not be a far-fetched idea that an entire data center could be destroyed. And if it is, AT&T should know long in advance how their system would behave in the event that it does happen. Netflix pioneered new methods for this type of testing. We have technology that allows us to deal with catastrophic failures more easily than ever. So why was AT&T not ready for this? Simply put, they either never asked themselves "What if...", or they did ask the question and didn't see it through to the end due to some constraints. Either way, this emphasizes the importance of testing in production and knowing how your system will behave in the event of a catastrophic failure.
So my advice to those of you in the software industry is this: If your system hosts critical components that lives depend on, go (safely) pull some plugs on your production system and monitor what happens. Use it as a chance to learn about your system and strengthen it. That way, you'll be prepared once the real catastrophe hits.
For more information on preparing for catastrophic events, you can get started over at the Principles of Chaos website. After that, I recommend reading some articles published by Netflix or picking up a book to deepen your understanding of the concept and to seek concrete steps for implementations of chaos testing to boost your system's resiliency.