Resiliency

The second tragedy of the AT&T Bombing in Nashville

Justin VanWinkle

Dec 26, 2020 — 2 min read

In the wake of the Nashville bombing tragedy on Christmas morning, many folks across Tennessee and Kentucky were left without internet and phone connectivity. 911 lines were down. Flights had to be cancelled. Even as I'm writing this, more than 36 hours later, there is still a significant number of people in that region who are still experiencing an outage. This could have easily been avoided. But how?

The telecom industry held us hostage for several decades. They created poor infrastructures and held monopolies on lines, poles, and regions. We still feel the negative impact of that today. In fact, let's consider our posture for an extreme scenario: what if this had been a natural disaster and people were desperately trying to find their loved ones, but simply could not locate them without contact? Here's my big question to you:

Why does the destruction of a single hub cause such wide-spread failure for so long?

There is absolutely no reason why this outage had to happen. And this is the basis of my writing. AT&T didn't have to move to the Cloud. They didn't have to implement the latest or hottest tech. No. It's much simpler than that. All they had to do was say, "What if...". It should not be a far-fetched idea that an entire data center could be destroyed. And if it is, AT&T should know long in advance how their system would behave in the event that it does happen. Netflix pioneered new methods for this type of testing. We have technology that allows us to deal with catastrophic failures more easily than ever. So why was AT&T not ready for this? Simply put, they either never asked themselves "What if...", or they did ask the question and didn't see it through to the end due to some constraints. Either way, this emphasizes the importance of testing in production and knowing how your system will behave in the event of a catastrophic failure.

So my advice to those of you in the software industry is this: If your system hosts critical components that lives depend on, go (safely) pull some plugs on your production system and monitor what happens. Use it as a chance to learn about your system and strengthen it. That way, you'll be prepared once the real catastrophe hits.

For more information on preparing for catastrophic events, you can get started over at the Principles of Chaos website. After that, I recommend reading some articles published by Netflix or picking up a book to deepen your understanding of the concept and to seek concrete steps for implementations of chaos testing to boost your system's resiliency.

AI in Retail: Transformative Use Cases, Success Stories, and Challenges

The retail industry is witnessing a profound transformation through the integration of Artificial Intelligence (AI). From personalized shopping experiences to supply chain optimization, AI is redefining how retailers operate and interact with customers. In this blog post, we’ll explore various use cases of AI in retail, share some success

Mastering Customer Interviews: Best Practices and Real-World Insights for Product Managers

In the dynamic world of product management, knowing your market and your customers is crucial. This involves in-depth research, data analysis, and most importantly, conducting effective customer interviews. Customer interviews provide invaluable insights into your users' needs, pain points, and the overall product experience. In this blog post, we

Streamlining AI Workflows with Apache Airflow: A Comprehensive Technical Guide

In the burgeoning field of artificial intelligence (AI), the challenge of integrating various machine learning (ML) libraries and frameworks into a cohesive pipeline often emerges. This is where Apache Airflow shines. Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. Originally developed by Airbnb, it has

Getting Started with Terraform: Managing Cloud Infrastructure as Code

In the rapidly evolving landscape of cloud-native technologies, infrastructure as code (IaC) has become a cornerstone for managing and provisioning cloud infrastructure. One of the most popular IaC tools is HashiCorp's Terraform. In this blog post, we will explore Terraform's capabilities, provide a step-by-step guide to

Read more

AI in Retail: Transformative Use Cases, Success Stories, and Challenges

Mastering Customer Interviews: Best Practices and Real-World Insights for Product Managers

Streamlining AI Workflows with Apache Airflow: A Comprehensive Technical Guide

Getting Started with Terraform: Managing Cloud Infrastructure as Code