If you looked at the news or Twitter this morning — or perhaps you couldn’t, because your Internet was malfunctioning — you might have heard: Time Warner suffered a major outage in its Internet service at about 4:30 a.m. Eastern. The outage, affecting much of the U.S., lasted two hours, Reuters reported. Maps created by the outage-tracker DownDetector showed problems throughout the country. So how exactly could this happen?
Popular Science talked with Purdue University computer scientist Sonia Fahmy, who researches network performance, to get her guesses on the culprit.
She hypothesizes that Time Warner was updating the software that its routers use to talk with one another and route information. “Typically, these outages are due to the routing protocols,” she says. That’s the kind of foundational function that, if there’s a bug in it, could cause widespread problems.
“Either they upgraded the software on some of the routers, and there was some kind of bug in it, or sometimes, it’s a human error,” she says. Configuring routers for a software update is a complex task, so people make mistakes.
Routers that are part of the Internet’s largest, core networks — the so-called Internet backbone — use something called the Border Gateway Protocol to tell each other what paths to use to send information on to the right destination. Fahmy thinks Time Warner could have been updating the software it uses to implement the BGP, which is often involved in major outages.
“Either they upgraded the software on some of the routers and there was some kind of bug in it, or sometimes, it’s a human error,” Fahmy says.
It’s the extent of the outage that makes Fahmy guess the problem was related to software rather than hardware. Service providers such as Time Warner have enough redundancies in their hardware that prevent these kind of widespread issues. A broken router or cable normally causes smaller, more regional outages, she says.
Noticeable, hours-long outages may be becoming more frequent. Fahmy says she’s seen reports like Time Warner’s occur once every month or two. In addition to software problems, it seems companies’ routers are aging and running out of memory — which is more of a hardware problem, but also, a systematic one.
Researchers are working on making routing protocols less likely to fail. One promising solution is called Software Defined Networking, which lets companies use one machine, called the controller, to configure many routers at once. That way, there are fewer chances for a human expert to make a mistake when configuring a router.