When it rains …

This morning, AWS experienced a DNS outage affecting DynamoDB in the US-EAST-1 region. Services from Snapchat to Fortnite to United Airlines went dark. By most accounts, it was resolved within hours, and life went on.

On Hacker News, the conversation quickly turned philosophical. One comment captured the prevailing wisdom perfectly:

Planning for an AWS outage is a complete waste of time and energy for most companies.

I understand the sentiment. I’ve faced it as direct feedback give AWS outages are genuinely rare. The company maintains impressive uptime across its services, and when issues do occur, they’re usually resolved faster than most organizations could execute a disaster recovery plan. The math checks out; why invest significant resources preparing for an event that probably won’t impact you materially?

When everything goes wrong at once

Years ago, I was responsible for most of the infrastructure and data systems at Vacasa. We managed vacation rentals across the US and Central America, and used AWS as our cloud provider of choice. I maintained a standing agenda item in our annual business continuity planning reviews: multi-cloud deployment strategy and hot site backup provisions.

But I never gained much traction. The conversation always circled back to cost-benefit analysis, probability assessments, and the reality that AWS simply didn’t go down often enough to justify the investment.

I would always begrudgingly admit that the logic made sense. We had other priorities, tighter budgets, and more pressing infrastructure concerns.

Until we didn’t.

I don’t remember the exact dates anymore, but I vividly remember the sequence. A tropical storm was bearing down on the southeastern United States. Regional emergency management offices were coordinating evacuations. They reached out to us with an urgent ask: call your guests staying in affected areas and help them get out safely.

We wanted to help. Desperately. These were people staying in our properties, trusting our service, in harm’s way. The human factor overwhelmed any technical or logistical concerns. We had their contact information. We had support staff ready to make calls. We had everything we needed.

Except access to our systems.

At exactly the same time, AWS suffered an outage. Not a brief hiccup – a full outage. The database? Inaccessible. Our data warehouse? Locked. Direct database connections failed. Application access failed. Every pathway to our guest reservation data hit the same wall.

We sat on support calls with our Technical Account Manager. We escalated through multiple teams. The consistent message: sit tight and wait. Even the people trying to help us couldn’t do anything but wait for AWS to resolve the issue on their end.

The storm kept moving.

Finding a way

Desperation breeds creativity. One of our ETL systems, running in a still-accessible EC2 instance, hadn’t finished its sync. It was still holding onto a stale but recent cache of reservation data. The data was only a few hours old but critically contained exactly the information we needed.

Our data engineering team extracted the cached data locally and forwarded guest contact information to our support staff. We got the calls out. Guests evacuated. The storm knocked out power, water, communication infrastructure, and emergency services in several areas for days.

But thankfully, no one was hurt.

The lesson that keeps on teaching

Let me be clear: there was no legal requirement for Vacasa to make those calls. It was a request from emergency services, not a mandate. We could have said “sorry, systems are down” and moved on.

But we cared about our customers’ well-being. When faced with the possibility that someone could be hurt because we didn’t try everything possible.

Still, the team found a way.

This is where the Hacker News comment about AWS outage planning being a “waste of time” reveals its limitations. The comment isn’t wrong about probabilities; AWS outages are statistically rare. For most companies, the direct business impact of a few hours of downtime is measured in lost sales, frustrated users, and maybe brief social media anger. Annoying, but not catastrophic.

The danger lies in compound events.

Business continuity planning and enterprise risk analysis require imagining not just individual failure modes, but their convergence. What happens when multiple low-probability events occur simultaneously? What happens when your technical infrastructure fails at exactly the moment your ethical obligation to customers matters most?

A lost sale because of an outage is frustrating. A lost life because you couldn’t access data during a crisis is horrific.

Rethinking the calculation

I’m not suggesting every company needs a multi-cloud deployment strategy. The Hacker News commenter has a point: for many organizations, the cost genuinely exceeds the benefit. Maintaining parallel infrastructure across AWS and Google Cloud and Azure demands resources most companies don’t have and probably shouldn’t spend.

But business continuity isn’t binary.

What saved us during that tropical storm wasn’t a carefully architected multi-cloud strategy. It was an accidental cache in an ETL system and a data engineering team willing to get creative under pressure. I would never recommend this as a reliable fallback plan. Still, it highlights something important: understanding your system’s failure modes and having some concept of degraded operation matters even if you never build the “perfect” solution.

Ask yourself: what would happen if AWS went dark for six hours right now? Not forever – just long enough to matter. Can you communicate with customers? Can you access critical data? Do you have any manual processes you could fall back on?

More importantly: what if that outage coincided with something else going wrong? A natural disaster. A vendor bankruptcy. A supply chain disruption. A regulatory emergency. The compounding of independent low-probability events creates scenarios that individually seem absurd but collectively demand consideration.

The human element

Part of why this experience stuck with me is that it wasn’t about revenue or SLAs or uptime metrics. It was about people being in danger and us having the information to help them but not the access.

That changes the calculation entirely.

Your business may never face life-and-death scenarios. That’s fine. But you probably face situations where the human impact of failure extends beyond inconvenience. Maybe it’s payroll data when employees need to pay rent. Maybe it’s healthcare information when someone needs urgent care. Maybe it’s financial records during an audit that could tank your company.

Understanding what really matters to the people depending on your service should inform how you think about continuity planning.

Plan accordingly. With clear-eyed understanding of what matters and what you’d do if it all went wrong at once.

Because sometimes, it rains. And sometimes it pours …

When everything goes wrong at once

Finding a way

The lesson that keeps on teaching

Rethinking the calculation

The human element

Share the word

Related Posts

Ethical Quandary: iPhones and Encryption

Climb On!

Surprising Search Results

Password Security