Why websites still break at the worst possible times
Even the mighty Amazon isn't immune from system failures.
Earlier this week, Amazon held its annual celebration of consumerism known as Prime Day, a 36-hour orgy of buying that spanned over a dozen countries. By Amazon’s own reporting, it was a success. Consumers bought over 100 million products, according to a statement from the company. They gobbled up deals, purchasing more than a quarter-million Instant Pots, more than a million smart-home gadgets, and stuff like water filters, DNA tests, and school supplies.
But it wasn’t smooth sailing. There were widespread reports of problems, and shoppers were faced with issues that included seeing an error page with a dog on it, finding that their shopping carts became empty, or having trouble when clicking on a “shop all deals” page. The chart at downdetector.com shows a spike in issues on Monday, July 16, when Prime Day kicked off.
Amazon has not explained what caused those problems. “It wasn’t all a walk in the (dog) park, we had a ruff start – we know some customers were temporarily unable to make purchases,” the company said in a statement, referring to those canine-filled error pages.
All this raises the questions: In light of the fact that even a web behemoth like Amazon.com can suffer hiccups, how do companies prepare when they know they’re going to expect a flood of traffic—and why do those systems still sometimes fail?
Fault tolerance and the Chaos Monkey
Companies need to set themselves up in advance for a deluge of traffic, like stockpiling your kitchen before a flood of hungry guests and making sure you can run to the grocery store quickly if you need to, too.
One tactic is to ensure that they have enough computing capacity available to dynamically adjust to the traffic they get. And an easy way to do that is to take advantage of the vast scale of cloud computing from the likes of Amazon Web Services (AWS), and competitors like the Google Cloud Platform and Microsoft Azure. Then, a company’s computing capacity can do what the industry refers to as “elastic scaling,” meaning that as they need more resources—computing power in response to web traffic—they can get it, in real time. It’s the equivalent of calling in computing reinforcements.
Of course, there’s a hint of irony in the fact that Amazon.com had problems on a day it knew it was going to receive a surge of traffic, given that it owns a service, AWS, that it sells to companies to avoid having such problems.
“The solution to every problem is to add more machines—you can’t do that if you’re a mom-and-pop shop,” says Justine Sherry, an assistant professor of computer science at Carnegie Mellon University who studies computer networks. “You’re probably still better off having resources that Amazon [via AWS] has put together, than what you can cobble together yourself.”
A related way that companies ensure that traffic is routed smoothly is using load balancers—machines in data centers that decide which other machines in the same center handle the requests, an important task whether traffic is light or heavy. Those machines will show you a copy of the website you want to visit, called a replica.
“The load balancer is just choosing a replica to give you,” Sherry says. “That’s really the magic that makes cloud computing work—it looks like one machine, but it’s actually thousands or hundreds of thousands, and that’s why they can handle so much load.”
That’s not all companies do when running data centers. They also think about preparing for the fact that some aspect of it could fail. And if a piece does break, will the system still work? For that, like an airplane, they need to have redundancy. The concept is called “fault tolerance.” And to test their fault tolerance, engineers will purposely conduct stress tests.
“I often find the kinds of things that they do really surprising—because they generally go in and try to break their own machines,” Sherry says. One tool for that is Netflix’s aptly named Chaos Monkey, which is software that disables parts of a system to see how it holds up to partial failure.
Another strategy for ensuring server stability: don’t mess with it before all the traffic hits. It’s a common approach in the retail industry as the holiday shopping season approaches, says Shuman Ghosemajumder, the CTO of cybersecurity company Shape Security. “They will lock down their infrastructure well before the peak season begins,” he says. “Often in September, sometimes as early as August, they’ll say ‘no changes are going into our infrastructure—because we just don’t understand what effect they might have under load.’”
Bottlenecks and moving parts
Still, despite the best preparations, services still fail. After all, they are complex systems.
One possibility is a bottleneck: a part of the system gets bogged down under the stress, even though the other parts are able to handle all the traffic. Think of it like a restaurant with plenty of open tables and servers, but a very narrow entranceway that is hard for customers to get through. A line forms, even if the restaurant has the right capacity in the kitchen and dining room and tons of food to serve.
One hypothetical bottleneck in a data center is actually those load balancers, Sherry says. Another possibility is the network itself, where the data can’t get through to the data centers in the first place. With Amazon.com, Sherry speculates that what could have caused it is something internal and specific, like a part of their database that handles account or product lists.
“Usually these things come down to bottlenecks,” she says. “Some piece of code or some piece of infrastructure that it turned out couldn’t handle the load, and it didn’t have enough redundancy built in.”
Another intriguing way that big systems can struggle is because of activity not from regular, individual consumers shopping for Instant Pots on only Amazon.com. Other companies that directly compete with Amazon could play a role, too, Ghosemajumder says. “In many cases, what they’ll be doing is scraping the price information off of their competitors’ sites in order to be able to make changes to their own sites, sometimes automatically,” he says. He notes that during this Prime Day, he saw traffic increases to other retail sites.
Ultimately, “it just comes down to there being so many moving parts,” Sherry says. “It’s hard to control for every possible case, every possible thing that could happen… I don’t think that we’ll ever reach a point where we don’t have some downtime.”
Update on July 19: CNBC has a report, based on internal Amazon documents, explaining further what happened.