Recent AWS glitches illustrate the power, and fragility, of cloud computing
Amazon Web Services and other major cloud providers underpin much of the internet. Here's what to know about them—and why they can fail.
In the first two weeks of this month, Amazon Web Services (AWS) hit some bumps that caused two outages: a bigger, more widespread one on December 7, and a smaller, more localized one on Dec. 15. Both catalyzed disruptions across a range of websites and online applications, including Google, Slack, Disney Plus, Amazon, Venmo, Tinder, iRobot, Coinbase, and The Washington Post. These services all rely on AWS to provide cloud computing for them—in fact, AWS is the leading cloud computing provider among other big players like Microsoft Azure, Google, IBM, and Alibaba.
To understand why the impact was so big, and what steps that companies can take to prevent something like these disruptions in the future, it makes sense to take a step back and take a look at what cloud computing is, and what it’s good for.
So what is cloud computing and AWS?
Whenever you connect to anything over the internet, your computer is essentially just talking to another computer. A server is a type of computer that can process requests and deliver data to other computers in the same network or over the internet.
But running your own server isn’t cheap. You have to buy the hardware box, install it somewhere, and feed it a lot of power. In many cases, it needs internet connectivity too. Then, to ensure that data is received and sent with minimal delays, these servers need to be physically close to its users.
Additionally, you have to install software that needs to be updated regularly. And you have to build fail-safe mechanisms that will switch over operations to another server if a main server malfunctions.
“The thing that companies like Amazon noticed is that a lot of [computing infrastructure] is not really specific to the service you’re running,” says Justine Sherry, an assistant professor at Carnegie Mellon University.
For example, the code running Netflix does something different compared to the code running a service like Venmo. The Netflix code is serving videos to users, and the Venmo code is facilitating financial transactions. But underneath, most of the computing work is actually the same.
This is where cloud providers come in. They usually have hundreds to thousands of servers all over the country with good bandwidth. They offer to take care of the tedious tasks like security, day-to-day management of the data center operations, and scaling services when needed.
“Then you can focus on your [specialized] code. Just write the part that makes the video work, or the part that makes the financial transactions work. It’s easier, it’s cheaper because Amazon is doing this for lots and lots of customers.” Sherry explains. “But there are also downsides, which is that everyone in the world is relying on the same couple of Costco-sized warehouses full of computers. There are dozens of them across the US. But when one of them goes down, it’s catastrophic.”
What went wrong with AWS on Dec. 7 and 15
What caused the AWS outages appeared to be related to errors with the automated systems handling the data flow behind the scenes.
AWS explained in a post that the December 7 error was due to a problem with “an automated activity to scale capacity of one of the AWS services hosted in the main AWS network,” which resulted in “a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks.”
This autoscaling capability allows the whole system to adjust the number of servers it’s using based on the amount of users on the network. “The idea there is if I have 100 users at 7 am, and then at noon, everyone is on lunch break Amazon shopping and now I have 1,000 users, I need 10 times as many computers to interact with all those clients,” explains Sherry. “These frameworks automatically look at how much demand there is and can dedicate more servers to doing what’s needed when it’s needed.”
Later on December 15, a status update issued by AWS said that the outage was caused by “traffic engineering” incorrectly moving “more traffic than expected to parts of the AWS Backbone that affected connectivity to a subset of Internet destinations.”
Big data centers have lots of internet connections through different internet service providers. They get to choose where online traffic gets routed, whether it’s over one cable through AT&T, or another cable through Sprint.
Their automatic “traffic engineering” decides to reroute traffic based on a number of conditions. “Most providers are going to reroute traffic mostly based on load. They want to make sure things are relatively balanced,” Sherry says. “It sounds like that auto-adaptation failed on the 15th, and they wound up routing too much traffic over one connection. You can literally think of it like a pipe that has had too much water and the water is coming out the seams.” That data ends up getting dropped and disappears.
Despite some prevalent outages over the past few years, Sherry argues that AWS is “quite good at managing their infrastructure.” Inherently, it’s very difficult to design perfect algorithms that can anticipate every problem, and bugs are an annoying but regular part of software development. “The only thing that’s unique about the cloud situation is the impact.”
A growing number of independent companies are turning to third-party centralized services like AWS for cloud infrastructure, storage, and more.
“If I pay Amazon to run a data center for me, store my files, and serve my clients … they’re going to do a better job than I can do as an university administrator or as an administrator to a small company,” says Sherry. “But from a societal perspective, when all of these small individual actors decide to outsource to the cloud, we wind up with one really big centralized dependency.”
Back to basics?
During the time AWS went out, Sherry could not control her television. Normally, she uses her phone as a remote control. But the phone does not directly talk to the TV. Instead, both the phone and the TV talk to a server in the cloud, and that server is orchestrating that in-between. The cloud is essential for some functions, like downloading automatic software updates. But for scrolling through cable offerings available from an antenna or satellite, “there’s no reason that needs to happen,” she says. “We’re in the same room, we’re on the same wireless network, all I’m trying to do is change the channel.” The cloud can offer convenient tech solutions in some instances, but not every application needs it.
One account of a marooned technology that struck her most as an unnecessarily roundabout design was a timed cat feeder that had to go through the cloud. Automated cat feeders have been around before the cloud. They’re basically paired to an alarm clock. “But for some reason, someone decided that rather than building the alarm clock part into the cat feeder, they were going to put the alarm clock feeder in the cloud, and have the cat feeder go over the internet and ask the cloud, is it time to feed the cat?” Sherry says. “There’s no reason that that needed to be put into the cloud.”
Moving forward, she thinks that application developers should review every feature that’s intended for the cloud and ask if it can work without the cloud, or at least have an offline mode that’s not as completely debilitating during an internet, data center, or even power outage.
“There are other things that are probably not going to work. You’re probably not going to be able to log in to your online banking if you can’t get to the bank server,” says Sherry. “But so many of the things that failed are things that really should not have failed.”