#main-nav-header { display:none; }

How to avoid outages in financial services

Three models for maximum uptime in financial services

In the financial services industry, downtime isn’t an inconvenience — it's a catastrophe. A single outage can lead to staggering financial losses, punitive regulatory fines, and long-term reputational damage.

That’s why financial institutions are on a quest for “unbreakable” infrastructure — one that ensures resilience. This journey has led many to explore multiple architectural models, from relatively simple single-cloud models to complex multicloud models. At the heart of this quest lies a fundamental tension: the pursuit of resilience versus the realities of cost, complexity, and operational risk.

Today, accelerating the resilience journey is urgent. The financial services industry remains among the most frequently targeted for cyber attacks. Cybercriminals, including those backed by nation-states, are attempting not only to steal valuable data but also cause significant disruptions to financial systems. And the availability of new technologies — including AI tools and (soon) quantum computers — are enabling attackers to launch larger, more sophisticated attacks that are more successful at causing disruption.

At the same time, the CrowdStrike outage in 2024 was a massive wake-up call to financial services companies — and regulators. Companies are now determined to explore new IT architectures that avoid single points of failure.

In working with leading financial services companies, I’ve found that there is no single path to resiliency and no single architectural model that is perfect for everyone. However, nearly all choose from one of three approaches. Whether you intend to strengthen resilience by moving from on-premises infrastructure to the cloud, or transitioning from a single cloud provider to multiple clouds, exploring the pros and cons of each model can help ensure you are making the right choice for your organization.

Model 1: Improving availability with a single cloud provider

The cloud has long been recognized for enhancing resilience. By using cloud services, organizations can avoid the cost and complexity of building, managing, and maintaining their own infrastructure — including backup data centers for protecting data and high-availability (HA) clusters for maintaining application availability.

For the vast majority of financial services organizations, cloud-based resiliency begins with a single, trusted cloud provider. They leverage that provider’s built-in HA features, for example, by distributing workloads across multiple availability zones (AZs). If one AZ goes down, an application is designed to have its traffic rerouted to the others, ensuring business continuity.

Pros	Cons
Reduced downtime: A massive improvement over traditional on-premises or single-AZ setups, this model helps avoid potential downtime related to server rack failures or local power outages.	Single point of failure: Though significant outages are rare for cloud providers, a systemic event could still take your services offline.
Simplified management: Your team needs to be proficient with only one cloud provider’s tools.	Complex management: With most cloud providers, setting up failovers between AZs still requires substantial manual effort.
Fewer upfront costs: You can trade large CapEx costs for more manageable OpEx costs.	Provider lock-in: Relying on a single provider makes it difficult to migrate to another if needs change, prices rise, or performance degrades. You also might miss out on innovative services offered by another provider.

Model 2: Reducing risk with a “polycloud” strategy

I first heard the term “polycloud” from Goldman Sachs technology executives around 2022, though the term may have been coined earlier. While a “multicloud” approach simply means using services from multiple cloud providers, a polycloud strategy involves strategically dividing workloads among two or more providers. The providers don’t necessarily run the same workloads at the same time. Instead, an organization assigns workloads to different clouds, often based on the appropriateness of a cloud service to a specific workload.

For example, a bank might run their retail banking applications on one cloud platform and their investment banking operations on another. I’ve also encountered a few large institutions that choose to host their website with one cloud provider and their mobile application with another.

Pros	Cons
Contained damage: This approach contains the damage of an outage: A problem with one provider affects only a subset of operations. If the provider hosting your site is down, you could still point users to your mobile app running on a different cloud.	Integration challenges: Ensuring seamless data flows and consistent security controls across clouds is a significant undertaking. It is mostly medium-sized and large organizations that have adopted a polycloud strategy.
Increased flexibility: You can avoid being locked into a single provider’s roadmap or pricing. And you can choose the best provider for each specific workload. If you have sufficient operational competence, you can continuously move workloads among clouds, optimizing for performance or costs.

Model 3: Eliminating service interruptions with an active-active multicloud approach

For the largest systemically important financial institutions (SIFIs), not even the polycloud model provides sufficient resilience. This handful of organizations instead implements an “active-active” multicloud architecture. With this architecture, the same critical workload — like a core banking application — runs simultaneously across two or more cloud providers. Traffic is load-balanced between them, so if one provider fails, all traffic is automatically rerouted to the other with no interruption in service.

When I ask leaders at these institutions why they have adopted this model, the answer almost always involves regulatory requirements. These are organizations that must adhere to the requirements for operational resilience, such as requirements outlined by the Federal Reserve or in the EU’s Digital Operational Resilience Act (DORA).

Pros	Cons
Maximum resilience: This model can withstand a complete failure of a cloud provider.	Cost: This model comes at a steep price. Running fully redundant infrastructure — even in the cloud — is expensive.
Regulatory confidence: This model satisfies the most stringent requirements for operational resilience.	Complexity: The complexity of managing an active-active environment is immense.
	Risk: The model’s complexity introduces new dangers: A misconfiguration in the intricate web of load balancers and data synchronization tools could cause an outage that wouldn't happen in a simpler setup.
	Lowest common denominator: When you have a mirrored active environment with two separate cloud providers, you need to forgo any distinctive services that one cloud provider might offer. Establishing two identical environments means finding the lowest common denominator between the two clouds.

Despite the challenges, this model is often a requirement for those few institutions that are considered “too big to fail.” This model provides the requisite proof to regulators that they have taken every possible step to ensure financial stability. Consequently, for these institutions, the high costs of this model are the necessary costs of doing business.

Envisioning a future of intelligent resilience

There is no one-size-fits-all solution. The right architecture for any given financial institution depends on their size, ability to handle complexity, risk tolerance, and regulatory obligations. As your organization works to enhance resiliency, the key will be to make informed decisions, carefully weighing the trade-offs.

Keep in mind that technology options will evolve — and that could alter your decisions. For example, AI and machine learning will likely play an increasing role in predicting and preventing outages while new tools will help simplify the management of complex polycloud and multicloud environments.

At Cloudflare, we are committed to helping all financial institutions build a more resilient and secure future. Our global network, which offers a highly available architecture with hundreds of redundant data centers, provides a robust foundation for all of these three architectural models. Our connectivity cloud provides a unified platform of cloud-native services with an API-first approach that enables teams to automate key workflows. With Cloudflare, financial services organizations can achieve high-performance connectivity, defend against emerging threats, streamline compliance, and accelerate innovation — all while controlling costs and reducing complexity.

This article is part of a series on the latest trends and topics impacting today’s technology decision-makers.

Dive deeper into this topic.

Explore the forces shaping the cybersecurity landscape and learn ways to build a more resilient organization in the Cloudflare Signals Report: Resilience at Scale.

Get the report!

Author

Trey Guinn — @treyguinn
Field CTO, Cloudflare

Key takeaways

After reading this article, you will be able to understand:

3 IT architectural models that enhance operational resiliency
The pros and cons of each architectural model
How to balance business continuity with risks, costs, and complexity

Receive a monthly recap of the most popular Internet insights!

Subscribe to theNET