How Can You Protect Your Cloud Applications from Outages?

How Can You Protect Your Cloud Applications from Outages?

Were you affected by the Amazon Web Services (AWS) outage this week? If you use Quora, Slack or Trello chances are you were.  S3 Storage services in the US-East region was essentially unavailable for 4-5 hours.  Since many other AWS services and customer applications depend on S3, this had a large wide-ranging impact across many services, sites and applications.

Some popular sites that were affected included:

  • Quora
  • Buisness Insider
  • Trello
  • Slack
  • Giphy
  • Medium
  • US Securities and Exchange Commission

Ironically, even the AWS status page that reports on the status of various AWS services was affected. In the brave new world of IoT devices everywhere – often relying on cloud services to operate – this resulted in people being unable to control the lights in their home or their thermostats, and there was even a story where an IoT home alarm was going off and they were unable to turn it off.

Were you affected by the AWS outage this week? If you use Quora, Slack or Trello chances are you were. S3 Storage services in the US-East region was essentially unavailable for 4-5 hours. Since many other AWS services and customer applications depend on S3, this had a large wide-ranging impact across many services, sites and applications

 

How Can You Protect Your Azure Applications?

No cloud provider is immune to outages – in fact Azure had a similar outage with Azure Storage Accounts a couple of years ago that had a wide-ranging impact. However, there are actions that software teams can take to minimize the risk of outages, and provide a high level of availability even in the face of infrastructure outages.

Azure provides a robust set of options for designing and deploying applications for high availability.  If used appropriately, this would allow your applications to survive outages like the AWS S3 outage this week.

 

Availability Sets

If your application uses an IaaS model with Azure VM’s, you can configure multiple redundant VM’s into Availability Sets, which automatically spread the VM’s across what Azure calls Fault Domains and Update Domains.  Fault Domains ensure there is no shared hardware (servers, power providers, network providers) between the VM’s, protecting your application from hardware failures.  Update Domains are used by Azure to coordinate updates that may result in VM reboots, such that only some of the VM’s inside an availability set are rebooted at the same time, ensuring your application remains available at all times. If you use Azure Managed Disks – a relatively new service – it also ensures that the disks your VMs use are isolated from each other across different “stamps”, ensuring that hardware failures only affect some but not all of the VM/disks.Imaginet Azure Quickstart - • Availability Sets - If your application uses an IaaS model with Azure VM's, you can configure multiple redundant VM's into Availability Sets, which automatically spread the VM's across what Azure calls Fault Domains and Update Domains. Fault Domains ensure there is no shared hardware (servers, power providers, network providers) between the VM's, protecting your application from hardware failures. Update Domains are used by Azure to coordinate updates that may result in VM reboots, such that only some of the VM's inside an availability set are rebooted at the same time, ensuring your application remains available at all times. If you use Azure Managed Disks - a relatively new service - it also ensures that the disks your VMs use are isolated from each other across different "stamps", ensuring that hardware failures only affect some but not all of the VM/disks.

 

Traffic Manager

To protect against an entire Azure Region outage, you can create multiple redundant copies of your Web Sites and/or VM’s across multiple Azure Regions.  Azure Traffic Manager provides an intelligent convenient method of load balancing the traffic across those resources and regions.  It can monitor and detect when services or entire regions become unavailable automatically redirecting traffic to the healthy services.  It can also direct traffic to the region closest to the actual user – useful even when everything is healthy.

Imaginet Azure Quickstart - • Traffic Manager - To protect against an entire Azure Region outage, you can create multiple redundant copies of your Web Sites and/or VM's across multiple Azure Regions. Azure Traffic Manager provides an intelligent convenient method of load balancing the traffic across those resources and regions. It can monitor and detect when services or entire regions become unavailable automatically redirecting traffic to the healthy services. It can also direct traffic to the region closest to the actual user - useful even when everything is healthy.

SQL Database Geo-Replication

When using Azure SQL Databases, you can configure Geo-Replication to protect against any region-specific outages. Even without Geo-Replication, SQL Databases are protected against hardware failures within a datacenter by having three separate copies of the data spread across isolated hardware within the same datacenter.  With Geo-Replication you can ensure your database remains available even if an entire Azure Region/Datacenter suffers an outage.  Geo-Replication maintains multiple copies of your database in multiple Azure regions (up to five different regions), and keeps them up-to-date automatically.  If an outage occurs, there is an easy mechanism to failover your database to one of the replicated secondary regions.

 

Geo-Redundant Storage

When using Azure Storage Accounts, even the least expensive tier (LRS) automatically creates three copies of your data spread across separate fault domains and update domains within the same datacenter.  To achieve even higher resiliency you can opt for Geo-Redundant Storage (or RA-GRS) which replicates your data across two Azure regions, storing three copies in each region/data center.

 

By making appropriate use of the above Azure services, and architecting your application infrastructure for high availability, even severe cloud outages like the one AWS experienced can be mitigated ensuring your applications remain available for your customers.

Of course, even the most well thought out architectures and careful deployments need to be tested.  If you’re striving for high-availability – able to survive even a region-wide outage – the only ways to know whether you’ve been successful is to wait for the next outage and pray, or plan and run tests pro-actively to test your applications.

Netflix in particular has been doing some amazing things in regards to testing their ability to handle unexpected outages.  They created a tool called Chaos Monkey – which later evolved into a suite of tools such as Chaos Gorilla, Chaos Kong, Latency Monkey, etc.  These tools run continuously in their production environment constantly triggering random outages, shutting down servers and other systems to ensure that the automated detection and failover systems are working properly.  Chaos Kong, for example, will take down an entire region – IN PRODUCTION!

 

Next Steps

If you are planning to move your applications to the cloud – or have existing applications in Azure – let us help you plan, architect, and implement your applications to ensure they remain highly available even in the face of Azure outages.

Act now with our Azure Quickstart

The Azure Quickstart is a combination of the Azure Discovery and the Azure Pilot. It provides your organization with education on Microsoft’s Azure offering and a methodology, based on industry best practices, to guide you through candidate selection. Schedule your Azure Quickstart with Imaginet TODAY!

Learn More

Click here to learn more about our other Azure Consulting Services.

Dylan Smith

About Dylan Smith

Dylan Smith is a Microsoft MVP in ALM, and an ALM (Application Lifecycle Management) consultant for Imaginet where he spends his time helping teams become more successful at delivering software. In addition to over 16 years experience designing and architecting mission critical applications, Dylan helps run the .Net User Group in his hometown of Winnipeg, Canada. In the past 8 years Dylan has focused on agile development techniques and practices. He has led the shift to agile and lean development practices across multiple teams, projects and companies.

Leave a Reply