Specialism

Our Processes

Free self assessments

Free downloads

Get monthly insights on cloud, AI, data, and software

Another AWS Outage. The Lesson Isn't 'Go Multi-Cloud' - It's That Resilience Is an Architecture Decision You Keep Deferring.

insight

cloud

After every major outage the same advice circulates: go multi-cloud. For most organisations, the resilience that matters is an architecture decision, not a procurement one.

Amazon Web Services had another difficult morning this week. On Tuesday, reports of failures climbed across the usual tracking sites, services slowed or stopped responding, and the familiar question went round boardrooms and engineering channels again: are we too dependent on one provider?

It is a fair question. It is also, for most organisations, the wrong one to lead with. The reflex after every high-profile incident is to reach for multi-cloud – run everything across two or three providers so that no single one can take you down. The instinct feels prudent. In practice it is one of the most expensive answers to a problem that usually has a cheaper, more effective solution sitting closer to home: how your own systems are built to behave when a dependency fails.

The detail that gets lost in the noise is this. The outages that make headlines are rarely the whole cloud going dark. They are almost always scoped to one region, and frequently to a single data centre within it. Whether that regional wobble becomes a company-wide outage for you was decided long before the incident – in architecture choices made months or years earlier.

The October 2025 AWS outage took down thousands of companies across more than 60 countries and generated over three million user reports in the United States alone. Almost all of it traced back to problems in a single region.

That region was us-east-1, in Northern Virginia – Amazon's oldest, busiest, and by reputation least reliable. The root cause was a DNS fault affecting a core database service, which then cascaded through everything that depended on it. The lesson most people took away was that the Cloud is fragile. The more useful lesson was about concentration: too many services, from too many organisations, sitting in one place with no plan for what happens when that place has a bad day.

The reaction is always the same. So is the mistake.

Multi-cloud has its place. For some workloads – regulated services with hard sovereignty requirements, or genuinely critical systems where the cost of downtime dwarfs the cost of duplication – running across providers is the right call. But adopting it as a blanket response to outage anxiety carries a real bill. You take on duplicated infrastructure, two or three sets of tooling and skills, data synchronisation across providers, and egress charges every time information crosses a boundary. You also inherit a harder operational problem, because your weakest provider now sets your reliability ceiling and your team has to stay fluent in all of them.

The question worth asking first is simpler and cheaper to answer. When your primary provider has a bad morning, does your application degrade gracefully, or does it fall over completely? That is not a procurement decision. It is an architecture decision – and one many organisations have quietly deferred because single-region, single-zone designs were faster to ship and cheaper to run.

What "the cloud is down" usually means

Cloud providers divide their capacity into regions, and each region into Availability Zones – physically separate data centres with independent power and networking. The whole point of that design is to let you survive the loss of one zone, or one region, without going dark. Most incidents are contained to that level.

A useful example came earlier in 2026. When AWS had an event in us-east-1 in May, analysis afterwards showed it was confined to a single Availability Zone. Customers who had spread their workloads across multiple zones – the documented best practice – barely noticed. Those running everything in the affected zone took the full hit. The provider behaved exactly as designed. The difference in outcome came entirely from the customer side.

This is the part the SLA does not cover. A provider's uptime commitment applies to its own components. It says nothing about whether your application keeps serving customers when one of those components misbehaves. Resilience is not something you buy with a cloud contract. It is something you design, build, and test on top of it. When a major provider sneezes, the businesses that catch a cold are usually the ones that built a single point of failure into their own design and assumed someone else would handle it.

What actually needs to change

None of this requires abandoning your provider or doubling your infrastructure bill. The work is mostly in how you architect, and how honestly you test.

Design for blast radius, not for the SLA

Spreading across multiple Availability Zones should be the floor, not the ceiling. Beyond that, the pattern that pays off at scale is cellular architecture: partitioning your system so that each cell serves a slice of traffic independently, and a failure in one cannot spread to the rest. The aim is to make the question "what is the largest thing that can fail at once?" have a small, known answer – rather than discovering it live.

Resilience you have not tested is a hope, not a capability

Plenty of organisations have a failover design on paper and have never run it under realistic conditions. The first time you exercise a failover should not be during a genuine incident, with customers watching. Scheduled failure testing – game days, controlled fault injection, regular disaster-recovery drills – is what turns an architecture diagram into something you can rely on. The teams that recover quickly are the ones that have practised.

Graceful degradation beats heroic recovery

A well-designed system bends before it breaks. If a downstream dependency disappears, the better outcome is reduced service – read-only mode, cached responses, writes queued for later – rather than a blank error page. Designing for partial function means a dependency failure costs you a feature for an hour, not your entire platform. That is a deliberate engineering choice, and it has to be made before the incident, not improvised during it.

Your hidden single points of failure are usually shared services

The dependencies that cause the worst surprises are rarely the obvious ones. They are the shared, often invisible services everything quietly relies on: DNS, identity and authentication, certificate management, and so-called global services that in fact live in one region. The us-east-1 pattern recurs precisely because so many control planes and "global" features are anchored there. Map your real dependency graph, including the parts you do not operate yourself, and you will usually find a handful of load-bearing assumptions nobody has questioned.

Treat multi-cloud as a deliberate trade-off, not a default

If multi-cloud earns its keep for a specific workload, do it with eyes open about the cost. For everyone else, the better-value position is often single-cloud across multiple regions, combined with portability where it counts. Containerisation and orchestration with Kubernetes, infrastructure defined as code, and open data formats lower the cost of switching or spreading later – without forcing you to operate everything twice today. You keep the option without paying full price for it before you need it.

The broader picture

Resilience is, at root, a business decision expressed in technical terms. The real questions are how much downtime each service can tolerate, how much data loss is acceptable if the worst happens, and what the business is willing to spend to close that gap. Answer those – your recovery time and recovery point objectives, service by service – and the architecture follows. Skip them, and you end up either over-engineering systems nobody needs to be that reliable, or under-protecting the ones that genuinely matter.

The concentration point is still worth raising at board level. According to industry market-share data from Synergy Research Group, the three largest providers now account for around two-thirds of the global cloud infrastructure market, and that concentration is precisely why a single regional fault can ripple across so many unrelated businesses at once. That is a legitimate strategic risk to name and plan for. But for most organisations the proportionate response is not to abandon the Cloud or to triplicate everything across it. It is to match each workload's resilience to what the business actually needs, and to build and rehearse for the failure modes that are certain to recur.

The providers will keep having bad mornings. us-east-1 will have another one. The only variable you fully control is whether yours is a non-event your customers never notice, or the reason you spend a day issuing apologies. That outcome is set in your architecture, not on your vendor's status page.

Q&A: Resilience as an Architecture Decision

Doesn't going multi-cloud just solve this?
It can, for specific workloads, but it is rarely the cheapest or simplest way to improve resilience. Multi-cloud adds duplicated tooling, cross-provider data synchronisation, egress costs, and the operational burden of staying fluent in more than one platform. For many organisations, a well-architected single-cloud, multi-region design delivers most of the resilience benefit at a fraction of the cost and complexity. Reach for multi-cloud when a workload's requirements genuinely demand it, not as a reflex after a bad news cycle.

We're on a single cloud provider. How exposed are we?
It depends almost entirely on how your workloads are distributed. If everything runs in one region, or worse one Availability Zone, you are exposed to exactly the kind of incident that makes headlines. If you span multiple zones and have a tested plan for losing a region, your exposure is far lower – on the same provider. The provider is rarely the deciding factor. Your distribution and your testing are.

What's the single most valuable thing we can do first?
Test a failure you have only ever planned for on paper. Pick a realistic scenario – lose an Availability Zone, lose a key dependency – and run it in a controlled way. Most teams discover that their documented recovery procedure has gaps, stale assumptions, or steps that no longer work. You learn more from one honest failover drill than from months of architecture diagrams.

How do we justify the cost of resilience work to the business?
Frame it in the business's own terms: what does an hour of downtime cost this service, and how likely is the failure it protects against? Resilience is not all-or-nothing, and not every system warrants the same investment. Mapping each service to a recovery time and recovery point objective makes the spend defensible and stops you gold-plating systems that do not need it. The goal is proportionate protection matched to business impact, not maximum redundancy everywhere.

Our architecture is years old. Is this a rebuild?
Usually not. Improving resilience is generally incremental: adding zone redundancy where it is missing, introducing graceful degradation in the highest-impact paths, mapping and removing hidden single points of failure, and establishing a regular testing rhythm. Each of those delivers value on its own. You harden the parts that matter most first, rather than stopping everything for a wholesale rebuild.

Working Through This With Vertex Agility

The shift this article describes – from assuming the provider handles resilience to designing and proving it yourself – is a conversation we are having with technology leaders across several industries right now. The specifics differ. Some are carrying single-region designs they have outgrown, some have a failover plan they have never tested, and some are weighing a costly multi-cloud move they may not actually need.

Our Cloud Consultancy practice works with organisations on exactly this: cloud strategy and migration aligned to business outcomes, hybrid and multi-cloud where it is justified using Kubernetes and OpenShift, cloud-native and serverless design, and infrastructure as code so that resilient configuration is repeatable rather than hand-built. Where we tend to add the most value is in the architectural governance around all of it – making sure the resilience you design is the resilience you can actually demonstrate when a provider has a bad morning.

If you want an independent read on where you currently stand, we offer a free Downtime Defence Audit. It assesses your disaster-recovery readiness, infrastructure redundancy and failover capability, backup and resilience-testing practices, incident response, and your monitoring, observability, and dependency tracking, then returns a report on the gaps that matter most. You can complete it on our website, and for anything more substantial, get in touch with us directly below.

Get in touch

Related case Studies

The Invisible Drain: Is Platform Debt Eating Your 2026 Innovation Budget?

Platform debt is the silent accumulation of manual workflows and rigid infrastructure. Learn how to identify financial leaks and shift towards a scalable, product-centric platform model.

The AI Gold Rush: Why The AI Gold Rush Is Over — Why That’s Actually Good News

An in-depth analysis of recent reports regarding OpenAI's hardware acquisition plans, arguing that such strategic adjustments signify a healthy maturation of the AI market rather than a slowdown. The article highlights continued investment, unmet needs, and market evolution as key drivers for AI's enduring future and its impact on modern business.

The AI Gold Rush: Why Recent OpenAI News Doesn't Signal the End, But a Maturation of the Market

The New EU Cloud Rules Only Bind Governments. The Reason Behind Them Binds You Too.

On 27 May 2026 the European Commission presented its Tech Sovereignty Package, including the Cloud and AI Development Act, restricting sensitive public-sector data from US hyperscalers. The restrictions stop at the public sector, but the underlying issue, namely US CLOUD Act jurisdiction over American-incorporated providers regardless of where data is stored, applies to every regulated business. This article explains the legal mechanism, the Dutch precedent blocking the Kyndryl and Solvinity acquisition, the trap of overcorrecting toward immature European providers, and the practical steps CIOs should take around data classification, portability, hybrid and multi-cloud design, and encryption key custody.