We deliver high-quality solutions for modern world challenges. As IT professionals, together with our partners, we build a digital future.
Our vision and long-term plan are to be an organization able to effectively solve the real business problems of top players.
It happened! Alarms were triggered, instances went down, clients started to complain. Wide and deep infrastructure failure is one of the biggest challenges that can lead to serious business consequences.
We faced such a situation recently with one of our AWS deployments, where 80% of the instances went down at the same moment. What happened? Many possible reasons can cause system status checks to fail:
- loss of network connectivity;
- loss of system power;
- software issues on the physical host;
- hardware issues on the physical host that impact network reachability.
In that particular case, the data center was affected by a power loss, but we learned that after 50 minutes from the official AWS EC2 Operational issue update.
It is impossible to completely avoid infrastructure issues, but we can prepare deployments, people, processes, and tools to quickly resolve most of them.
The first thing to remember is that proper monitoring and alerting is crucial. In that case, our system (thank you SolarWinds!) notified us within 60 seconds. EC2 alarms were triggered 4-5 minutes later and the official AWS EC2 Operational issue was created 25 minutes after the beginning of the incident.
The second thing is the proper deployment in the cloud, with High Availability in mind. Usually, that involves load balancing, deploying to multiple locations and automatic failover. EC2 auto-recovery feature can be used to recover instances from a system check failure. Moreover, the AWS AutoScaling feature can be configured in such a way that a new instance is launched when a health check fails on the current instance. Larger deployments and more mission-critical services can benefit from the other, more sophisticated EC2 architecture components designed for high availability.
Last but not least, the proper Disaster Recovery strategy should be defined, implemented and properly verified. AWS recommends four different approaches, ranging from the low cost and low complexity of making backups to more complex strategies using multiple active Regions.
Broad Disaster Recovery strategies recommended by AWS:
- Backup and restore
- RPO/RTO: Hours
- Lower priority use cases
- Restore Data after the event
- Deploy resources after the event
- Cost $
- Pilot light
- RPO/RTO: 10s of minutes
- Less stringent RTO & RPO
- Core services
- Start and scale resources after the event
- Cost $$
- Warm standby
- RPO/RTO: Minutes
- More stringent RTO & RPO
- Business Critical Services
- Scale resources after the event
- Cost $$$
- Multi-site active/active
- RPO/RTO: Real-time
- Zero downtime
- Near-zero loss
- Mission Critical Services
- Cost $$$$
You can read more about that in the AWS Whitepaper: https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-options-in-the-cloud.html
Our monitoring worked pretty well and allowed us to start solving the issue within minutes. AWS Premium support plan enabled creating proper technical cases immediately, without waiting for the 20-minute instance checks snooze period.
Due to the nature of the incident, we were not able to recover the affected instances. Moreover, our DR plan for that deployment was Backup and restore, so we started the recovery process. However, AWS restored power in the affected data centre after 45 minutes, which allowed the majority of EC2 instances and EBS volumes to recover. However, due to the nature of the power event, some of the underlying hardware experienced failures that needed to be resolved by engineers within the facility. Engineers worked to recover the remaining EC2 instances and EBS volumes affected by the issue.
The first instance that was affected was fully restored within an hour and the last within three hours, according to our data.
We analyzed this incident, wrote lessons learned and made a decision to increase the Disaster Recovery plan for that deployment.
How exactly can our clients benefit from our lessons learned?
First of all, we carefully discuss with our clients the possible Disaster Recovery strategies, explaining in detail all their pros and cons. Secondly, we guarantee that the chosen strategy is implemented correctly and frequently tested to avoid any surprises when a real disaster occurs. Last but not least, we offer well-tested services that have a proven and successful track record.
SolDevelo is a dynamic software development and information technology outsourcing company focused on delivering high-quality software and innovative solutions. An experienced team of developers, customer-oriented service, and a passion for creating the highest quality products using the latest technology are the undeniable advantages of the company.
Using Atlassian products since 2009, SolDevelo always strives to exceed customers’ expectations. ISO 9001 confirms our dedication to the highest quality and ISO 27001 shows that we treat security extremely seriously. Over 70% of our team members are certified Scrum Professionals, over 35% are Oracle Certified Professionals and 100% of our quality assurance team has ISTQB certificates.