This week saw two nearly simultaneous infrastructure failures in major industries: finance and transportation.
On Wednesday, July 8th, the New York Stock Exchange abruptly went down for a big chunk of the trading day. Suspicions of a cyber attack erupted almost immediately after the exchange went dark, but the NYSE denied this, and later clarified that the problem had resulted from gateway software compatibility.
The same day United Airlines experienced a network crash due to what they say was a faulty router connection that degraded network connectivity. After 59 cancelled flights, the network was mostly back online.
Both were back online within a matter of hours, and while some damage was done the majority of people went on about their lives without problem. But the frequency of these episodes is increasing as networks become more complicated and as we rely more on them for day-to-day life.
A New Type of Natural Disaster?
There’s an argument to be made that network outages are becoming the world’s most frequent natural disaster: while the results are more often inconvenience than destruction, they’re complicated to fix, and affect telecommunications, service providers, transportation, finance, and sometimes even medical devices.
So what’s causing the problems?
David Erickson of Forward Networks, a startup focused on bringing more computer science practices into networking, says the problem is more than just human error: it’s an increasingly complex and uncoordinated system of hardware and languages. “You’ve now got organizations that have thousands or tens of thousands of devices that are moving packets: routers, switchers, firewalls–you name it,” he tells Popular Science, “and each of these things has upwards of between 1,000 and 1,000,000 or more rules that actually define the behavior of how what it does with packets as they come in and out.”
Those things can be taught to play nice together, but Erickson says it’s a steep learning curve. “The net problem is that it’s primarily humans that are having to install, roll these things forward, fix them, evolve them, everything. and it’s no surprise at all that one misconfiguration can pretty easily bring down major critical systems, which is what you saw with United.”
The problem increases with each passing year. Erickson says that over time, these devices “have become more and more complex, and you have more and more of them, they’re individually more and more complex, and there’s continually new software demands being placed on them.”
Erickson explains that many companies don’t realize that major chunks of their company and their operating systems are “just a couple of software misconfigurations away from being turned off or unavailable.” And it’s not just a time-and-money quandary when things go down. He pointed to a subreddit about network issues where a user offered an anecdote of neonatal heart monitors being configured on a network that was not functioning.
Mo Complexity, Mo Problems
Part of the problem comes from the fact that networks are a relatively new infrastructure. There are more safeguards for utilities like power and water, but they don’t exist for networking.
And they get worse with age not just because of new complexities, but the sudden appearance of old problems.
Eric Hunsader, of Nanex, LLC, a company that makes stock market information software. Hunsader has been in software development for the financial industry since 1986, and Nanex processes market data of stocks, options, futures and everything trading in the U.S. and worldwide. He explains that problems present from the start can take time to make themselves known. “As your product matures, after a while the only bugs left are the ones that nobody foresaw, and they tend to be the real difficult ones to figure out. So if technology is more complex with fewer errors, the few errors are significant.”
For the NYSE, shutdowns aren’t necessarily a problem, but bad timing can create a nightmare. “The nightmare is it happening one second before close,” says Hunsader. “The best time for something to cause trouble is 15:59:59. The problem is that so much of the system depends on those closing prices. You would back up everything.” He says that trades and options would have to be rolled back, and an error in the last few minutes of the day could cause the market not to open the next morning.
The good news is that things like the stock market, which are frequently affected by panic, aren’t affected by network issues most of the time. Hunsader says it doesn’t have any psychological impact on the market. If it had been an attack, well, “I think it would be all the difference, because we’d all be thinking if they can take a server out, maybe they can do it again. Or even worse, to be able to do it without being detected or… the smart thing would have been to rake the system for money.”
Airlines are a little more sensitive: a down network means down planes, which can ruin everyone’s holiday weekends and cost millions in a matter of hours. But these problems aren’t so simple as someone kicking a power cord out of the back of a Netgear product.
Testing, Testing…
Right now there’s not really a technology in place that lets network experts test configurations in a vacuum. Erickson says changes are planned ahead of time, with group consensus being the only reliable estimate of what’s going to happen. Once it goes live, testing new system arrangements (like with the NYSE or United) is a race against time. “If you mis-configure a device that happens to be the core device at that moment in time, it doesn’t matter how much redundancy you have.”
The lack of a standardized language or set of practices means that experts are in demand. Erickson says to be a network’s perfect manager, “you’d need to be able to understand all of the devices you’ve got in your network. There are tons of vendors, hundreds if not thousands of devices.”
Is there someone trying to standardize these tools? Not really. competitors have no impetus to collaborate when they can earn more from beating one another on innovation. But the reality of the market is that companies aren’t often replacing every unit every time there’s an upgrade, so legacy systems and legacy software will always be a problem.
Erickson says that in the absence of customers saying, “we need some sort of unifying standard here so we can have confidence our network is doing what we expect it to do all the time,” then it’s just not going to happen.
And even if there was desire, it’s a huge undertaking. “For a company to solve this,” Erickson explains, “they have to then go out and talk to every one of these devices and understand them extremely well,” he says. “And that’s just really hard.”