Heni Games: The 8 worst outages of 2021: AWS, Google Cloud, Fastly, and more

Jumat, 31 Desember 2021

The 8 worst outages of 2021: AWS, Google Cloud, Fastly, and more

2021 was an ignominious year for apps and websites backed by the cloud — which is basically all of them.

Cloud service outages are nothing new. However, 2020's shift to working from home exposed tons of vulnerabilities, as carriers, cable and fiber companies, and every popular app under the sun experienced some temporary, catastrophic collapse. It placed an unprecedented burden on the cloud infrastructure systems that back your favorite streaming and productivity sites. These outages were an inevitable consequence.

You'd have hoped 2021 would show marked improvement. Instead, it proved that the internet is a deck of cards ready to collapse if the wrong foundational piece folds. Whether it's due to frugalness or poor planning, many sites put all their data and traffic eggs in one cloud basket; just one node failure can take out some of the highest-traffic sites, when we'd expect these sites to have much better contingencies in place.

We saw our favorite messaging apps, smart homes, gaming networks, productivity suites, and social media sites collapse at one point or another this year. Beyond that, the Amazon Web Services (AWS) and Facebook outages proved how much of our daily lives depend on the cloud, from smart home tech to our package deliveries.

Looking back on the worst outages of 2021, we can only hope things improve in 2022. But there's no reason to assume they will unless cloud infrastructure companies and content delivery networks (CDNs) change the way they do things — and unless companies start adding offline functionality to cloud-reliant tech.

1. AWS outage stops deliveries, cameras, and cat feeders

The recent December AWS outage is likely still fresh in your mind. Amazon Web Services allegedly runs about 33% of cloud infrastructure services, so when AWS fell apart on Dec. 7, it may have taken about a third of cloud services with it.

According to the AWS team, the AWS internal network for monitoring, internal DNS, and authorization services somehow triggered a "large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks." Because this internal network is linked to the global AWS servers, it caused traffic delays or outright site shutdowns internationally for about 7 hours until the devs could fix the internal network.

During holiday shopping, Amazon delivery drivers' apps with routes and addresses went down, leaving them unable to complete deliveries. Nor could consumers make new Amazon orders, which means companies missed out on almost a day of revenue. First-party Amazon services — Alexa, Ring cameras, Prime Video, and Music — all went down, meaning their smart video doorbells and baby monitors became temporarily worthless. And popular third-party apps like Disney+, Venmo, and iRobot all broke down thanks to their choice of cloud provider.

According to CNBC, the AWS outage effects even rippled out to disrupt final exams at colleges, since some exam services relied on the cloud to work. Even some "smart" automatic cat feeders stopped feeding their cats for the day.

Following this outage, Android Central readers said they were warier than before about cloud-dependent smart home tech. And while experts think Amazon needs to incorporate offline controls into its smart home tech, they also think it's unlikely. Again, this is because the cloud lets them sell cheap, underpowered tech that wouldn't have the ability to run without it.

2. The Meta-verse falls apart

If we're talking messiest 2021 outages, we have to mention Facebook. Right before its Meta name change, Facebook accidentally shut down its own cloud services due to "configuration changes on the backbone routers that coordinate network traffic between our data centers," which cascaded and brought down all of their online services. It ensured no one could access any Meta services worldwide, including its own employees.

Even though Meta's cloud servers only power its own businesses like Facebook, Instagram, and Whatsapp, this outage still rippled out to hurt other companies. Any sites reliant on Facebook logins became inaccessible to their users, while other shopping sites or games reliant on Meta's servers or tokens shut down as well.

Plus, of course, this Facebook outage undermined its own cloud-powered peripherals. Quest 2 owners could no longer access their library of games due to the Facebook account requirement, while Ray-Ban Stories smartglasses lost their smarts. We commented at the time that Facebook needs to add offline support for its tech in the future.

Above all, the 6-hour Whatsapp outage proved the worst fiasco for the company. For the millions who use the app as their primary way to communicate with family, even a single day without it was one day too many. After the outage, Telegram reportedly got 70 million new members. That doesn't necessarily mean Whatsapp lost that many users, but it definitely saw a significant exodus that it might never win back.

Whatsapp, Facebook, and Instagram had a similar outage in April 2021, though that one lasted a mere 45 minutes.

3. Fastly takes down the internet

When something works, you don't pay attention to it. So a lot of people had never heard of Fastly's content delivery network (CDN) until it broke in June, dragging down some of the most popular websites with it.

A CDN helps cache content for faster loading times and reduced bandwidth load on hosting servers, which is why so many companies rely on them. They deliver data at high speeds across the globe, ensuring that data transfers to different locations around the world to keep load times low regardless of where the user lives.

But in the case of Fastly, a faulty service configuration "triggered disruptions across our POPs globally," which hurt the sites that relied on its edge computing. Specifically, sites like Amazon, Twitter, Reddit, Google, CNN, the Guardian, and The New York Times all went at once in early June. Fastly restored "95%" of its services within 49 minutes, making this a broad but relatively short-lived outage compared to the rest.

4. Four PSN outages made for a messy PS5 year

Assuming you've managed to buy a PS5 this year, you likely ran into problems accessing your library or playing multiplayer games at some point in 2021. Sony and CDN Akamai Technologies have dealt with several outages throughout the year.

The worst, most protracted PSN outage occurred from late February through early March, ensuring that some PS5 and PS4 players couldn't access their gaming libraries sporadically across several days.

Yet three more outages in subsequent months indicated that Sony had fundamental network issues to work through. In each case, players around the world would encounter error messages about maintenance when accessing online services, with outages lasting anywhere from 1 to 5 hours.

Among the best PS5 games, many require constant online connections or revolve around multiplayer. If Sony can't keep its PSN service functional for days at a time in 2022 as well, that's bound to make its loyal fans unhappy.

5. Google can't Assist its smart Home customers

Our first major outage of 2021 came in February courtesy of Google Assistant's sudden bout of amnesia. If you attempted to ask your Nest or Google Home speaker a question, you'd be told the "device is not yet set up" despite all evidence to the contrary. That made it impossible to connect to your Google Home devices associated with your account, from smart lights to Nest security tech. Plus, the Google Assistant Android app also had issues answering questions.

This appeared to impact all Google Home users that evening, with users taking to Reddit and support forums for help. Google did fix the issue that evening, a few hours after the issue became widely known, though it's not clear when exactly it started.

6. Wink's smart home winks out

Most of the worst 2021 outages affected a wide range of sites for a relatively short time. The award for the truly worst outage of the year, however, goes to Wink Hubs, which shut down for 10 days. Due to their new dependence on cloud services to work, these hubs could no longer control Zigbee or Z-Wave products at all, making them all but worthless.

Wink offered a 25% discount on its subscription costs as an apology but as far as we know never actually explained what caused the issue — only stating it would "optimize the Wink Backend and our API now that it is back up." Many customers saw this outage as a sign that it was time to abandon Wink for good.

7. The Android Exposure Notifications System goes kaput

When it comes to contact tracing and preventing COVID-19 exposure, any delay in knowing your condition can lead to further spread and sickness. So when the NHS COVID-19 app glitched due to issues with the Android Exposure Notifications System in Google's backend, that wasn't a good look for Google.

People wanting to check their status found an indefinite "Loading" screen. Google announced it would look into the issue after about 12 hours of bug reports, then took an additional 5-6 hours to resolve the bug. Add in the creepy "phantom notification" glitch from 2020 — incorrect notifications that users had been exposed to COVID-19 would pop up, then vanish before you could tap on it — and people had plenty of reasons to distrust the app by that point.

8. AWS outage redux

Following the major AWS outage on Dec. 7, we saw a second AWS outage on Dec. 15 caused by issues at Amazon's Oregon and Northern California Amazon Web Services facilities. This time, it took out Twitch, DoorDash, Xbox Live, PSN, Ring, Disney+, and T-Mobile.

Then, we saw a third AWS outage on Dec. 22 that shut down Fortnite, Hulu, Quora, Slack, and Imgur. In this case, a power outage at an east-coast facility caused the issue. So that made three outages in three weeks. The latter two outages only lasted an hour or so, though that's certainly long enough to cause problems.

Will the outage problem diminish or grow in 2022?

These various events highlight how fragile our current cloud-dependent system can be. With so much of our internet use concentrated on a few apps and services — most of which use a few major cloud infrastructure providers — a single crisis can cripple our productivity or render our expensive tech useless.

So can we hope for fewer mishaps next year?

To see fewer outages, we'd need to see more investment in cloud infrastructure. The recent infrastructure bill has billions allotted for improving high-speed, rural broadband access and civilian cybersecurity, but most of the worst 2021 outages came from company errors, not hostile actors. So we may have to count on (or pressure) companies to invest more in cloud infrastructure themselves.

As it stands, Gartner predicts companies will spend $482 billion on cloud services in 2022, a 21.7% increase. That should be a step in the right direction, at least.

It's important to note that many of the worst outages stemmed from companies' internal monitoring networks or from third-party CDNs, not the main servers. The very systems meant to oversee and prevent outages can bring the whole system down in the wrong circumstances, where human error can have disproportionate consequences. And while CDNs are vital for providing the fastest possible traffic, they do add one more potential step where something can go wrong.

When a single node, server, or data center can topple the system, it doesn't matter how much you invest. For major outages to decrease in 2022, we need companies to structure their data better, so backups can kick in quickly until the problematic node is fixed. We're in much better shape than we were two years ago, but we have a long way to go until outages become less persistent.

Heni Games

Appnext