The latest wake-up call

Contents

On 20 October 2025, Amazon Web Services (AWS) suffered a major outage, rooted in its US-EAST-1 region, which cascaded into a broad disruption of apps, services and even government systems around the world.
Among the casualties were prominent consumer and enterprise services including Snapchat, Signal, Fortnite, Coinbase, Lloyds Bank, and even the UK government’s tax portal via HM Revenue & Customs.
While AWS reports “significant signs of recovery”, the incident brings sharply into focus a systemic fragility in the global Internet infrastructure: too many services relying on too few providers.


Single Points of Failure – what they are and why we should care

A Single Point of Failure (SPOF) is a component of a system whose failure would cause the entire system (or a critical portion of it) to fail. In the context of cloud infrastructure and the Internet, the concept has become increasingly relevant.

Why it matters

  • When major cloud providers or a single data centre region goes down, the ripple effects can be enormous. In the AWS case today, countless downstream services (applications, banks, government services) were impacted.
  • Because many services outsource infrastructure (compute, storage, databases) to large cloud providers, dependencies have grown concentrated. In effect, one provider’s outage can knock out many clients simultaneously.
  • As one industry blog puts it: “This outage demonstrates that Internet infrastructure … has evolved some single points of failure that can cascade far beyond their original scope.”
  • The economic, operational and reputational cost of a major outage is large — for the affected provider and its clients.

Where the SPOFs lie

Some of the weaker points in today’s Internet/cloud ecosystem include:

  • Cloud provider regional failures: A data centre region or critical service (e.g., database or DNS) going offline causes many services in that region or using that service to crash.
  • Dependency on a single provider feature: For instance many services use AWS’s DynamoDB, S3, etc. If that service degrades, the dependent applications suffer.
  • Monopolised infrastructure: When many businesses cluster around a few large cloud providers (AWS, Microsoft Azure, Google Cloud), outages in these large providers impact wide swathes of the ecosystem.
  • Lack of diversification in architecture: Applications may assume “the cloud provider will be up,” but often don’t have fallback strategies across providers or data centres.

The AWS outage – a textbook case

Today’s incident serves as a clear example of how a SPOF in cloud infrastructure affects the broader Internet:

  • The AWS disruption appears to stem from failures in one or more key infrastructure components in US-EAST-1.
  • Because many services are hosted (or have components hosted) in that region, or rely on AWS services globally, the knock-on effects were extensive: from consumer apps, financial services, to government portals.
  • The fact that one provider region can have such broad impact illustrates the systemic risk: the Internet is increasingly built on a handful of providers whose failures ripple outwards.

Why this is a broader concern for the Internet

  • Democratization of service doesn’t necessarily mean dispersion of risk: While more services move to the cloud (good for agility), it also means more critical infrastructure is concentrated.
  • Chains of dependency become opaque: A small service might outsource parts of its stack to AWS, which itself may depend on other internal systems – a failure at one layer cascades. Research into cloud service dependencies underscores how difficult it is to trace and manage the “intensity of dependency”.
  • Resilience becomes more expensive and complex: To avoid SPOFs, organisations must build multi-cloud, multi-region fallback plans – but many services don’t invest at that level.
  • Regulatory and systemic risk implications: When large sectors (finance, government, telecoms) depend on one provider, outages become national-level risk events — not just “IT hiccups”.

What organisations and the Internet need to do

Here are some of the key mitigation strategies to reduce SPOF risk:

  1. Design for redundancy and multi-provider architectures
    Use more than one cloud provider or multiple regions. Replicate services, data and avoid reliance on a single region or service component.
  2. Trace and map dependencies
    Understand which services depend on which underlying infrastructure (databases, DNS, cloud provider services) so you know where failures could cascade.
  3. Failover and fallback planning
    Ensure there are defined strategies if primary infrastructure fails – automated or manual failover to alternate cloud, region or provider.
  4. Monitor and test resilience
    Simulate outages (chaos engineering) to check if failover paths work, and monitor for dependencies you might have overlooked.
  5. Avoid over-concentration in one provider’s “critical services”
    Just because one cloud provider offers convenience doesn’t mean you should put all “eggs” in one basket – especially for infrastructure that is critical to your organisation or customers.
  6. Transparency & SLAs
    Cloud providers should provide transparent status and impact data; clients must negotiate SLAs that reflect the risk of provider-level failures.

Conclusion

Today’s AWS outage is a stark reminder that despite the cloud being marketed as “always on,” the underlying infrastructure still contains single points of failure, and those failures now affect much more than just one service or one company. As more of the Internet relies on a handful of major providers, resilience must become a first-class priority.

For businesses, developers, and even governments, the question isn’t if a provider will fail in a way that affects them – it’s when. The right question now is: What happens to your service if your cloud provider, data centre region, or one key infrastructure component goes down completely?

Shopping Basket