Amazon Web Services has explained what went wrong to cause the major outage that crippled many businesses this week. Here's what you need to know.
Amazon Web Services has explained what went wrong to cause the major outage that crippled many businesses this week. In aBetween 11:48 p.m. on Oct. 19 and 2:40 a.m. on Oct. 20, Amazon DynamoDB experienced “increased API error rates” in its Virginia US-East-1 Region, the main region for deploying applications.
It says the incident was triggered by “a latent defect” — in other words, a hidden fault — within the service’s automated DNS management system. This caused endpoint resolution failures for DynamoDB, AWS noted.Services such as DynamoDB maintain “hundreds of thousands of DNS records to operate a very large heterogeneous fleet of load balancers in each Region,” AWS said. “Automation is crucial to ensuring that these DNS records are updated frequently to add additional capacity as it becomes available, to correctly handle hardware failures, and to efficiently distribute traffic to optimize customers’ experience,” according to AWS.” — which happens when multiple requests are sent concurrently to the same endpoint — in the DynamoDB DNS management system resulted in an incorrect empty DNS record for the service’s regional endpoint that automation failed to repair.experienced increased connection errors for some in the same area between 5:30 a.m. and 2:09 p.m. on Oct. 20. “This was caused by health check failures in the NLB fleet, which resulted in increased connection errors on some NLBs,” AWS explained. In tandem, between 2:25 a.m. and 10:36 a.m. on Oct. 20, new EC2 instance launches failed. While instance launches began to succeed from 10:37 a.m., some newly-launched instances experienced connectivity issues, which were resolved by 1:50 p.m., according to AWS. “The delays in network state propagations for newly launched EC2 instances also caused impact to the network load balancer service and AWS services that use NLB,” AWS said.AWS has now issued an apology for the incident. “We apologize for the impact this event caused our customers,” AWS wrote. “While we have a strong track record of operating our services with the highest levels of availability, we know how critical our services are to our customers, their applications and end users, and their businesses. We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further.”For example, it has already disabled the DynamoDB DNS Planner and the DNS Enactor automation worldwide. “In advance of re-enabling this automation, we will fix the race condition scenario and add additional protections to prevent the application of incorrect DNS plans.” For NLB, AWS is adding a velocity control mechanism to limit the capacity a single NLB can remove when health check failures cause AZ failover. For EC2, AWS is building an additional test suite to augment its existing scale testing, which will exercise the DWFM recovery workflow to “identify any future regressions.” The AWS outage had a huge impact, leaving some firms unable to operate for hours due to issues with the apps they depend on. AWS has delivered its post event analysis very quickly, which is to its credit. However, the damage has already been done to its reputation.
Amazon Web Services AWS Post Incident Analysis AWS Post Mortem AWS Dynamodb AWS Changes After Outage AWS What Went Wrong AWS Apology Outage What Caused AWS Outage Amazon Web Services Outage
United States Latest News, United States Headlines
Similar News:You can also read news stories similar to this one that we have collected from other news sources.
The AWS Outage Was a Nightmare for College StudentsThe outage impacted online learning platform Canvas, which is used by half of college students across the US, leaving many unable to access course materials or contact professors.
Read more »
AWS Outage Is A Stark Reminder Of How Cloud Controls Our Daily LivesWhen the cloud failed, it was not just servers that went dark. It was a reminder of how connected and vulnerable we really are.
Read more »
AWS Outage: What To Know About Amazon Cloud Service Outage TuesdayReports of outages were slowly rising again Tuesday morning.
Read more »
AWS Outage Sends Corporate America Into TailspinPlus, how you can use AI for your year-end reviews, why OpenAI is hiring bankers and a look at a bleak holiday job market in this week's Careers newsletter.
Read more »
AWS outage was a 'wake-up call' for businesses: What users can learn from the chaosAmazon Web Services says its massive outage has been resolved, but the ripple effects are still raising alarms across the digital economy.
Read more »
AWS Outage Causes Smart Bed Malfunctions, Disrupting Sleep and Prompting User FrustrationAn Amazon Web Services (AWS) outage caused problems for users of Eight Sleep's smart bed system, disrupting sleep and leading to user complaints. The outage affected the bed's temperature control, positioning, and soundscape features. The company's CEO apologized, while users called for offline functionality.
Read more »
