What Boeing’s door-plug debacle says about the future of aviation safety

This article was originally featured on MIT Press.

On January 6, as Alaska Airlines Flight 1282, a Boeing 737 MAX 9, was climbing out of Portland, a large section of the aircraft’s structure, a fuselage door-plug, broke free in flight. With the plug gone the cabin violently decompressed with a clamorous boom and gale that ripped headrests from their moorings. The mother of a teenage boy seated just in front of the rupture clung to him as his shirt was torn from his body and sucked into the void.

Nobody died in the harrowing incident, somewhat miraculously, but it was a very close call. If the seats directly next to the failed fuselage section had not been empty, or the seatbelt light had not been lit, the event would probably have been deadly.

Failures in modern jetliners are extremely uncommon events in general, but even in this context, the failure looks unusual and concerning. Definitive explanations for why it occurred will take time, but preliminary reports strongly indicate that its proximate cause was shockingly mundane: It seems Boeing, or one of its contractors, simply failed to secure the plug correctly. The errant door-plug appeared to be missing crucial bolts when it was discovered in a residential neighborhood, and subsequent inspections have reportedly revealed improperly bolted plugs on other fuselages. If this theory is confirmed, then it will be the sheer ordinariness of this failure that sets it apart. This is because when jetliners fail for mechanical reasons, those reasons tend to be much more complicated and interesting (at least from an engineering perspective). For a flight to be imperiled by a prosaic and eminently avoidable manufacturing or maintenance error is an anomaly with ominous implications.

To understand what I mean here, it helps to put the incident into context, and for that, it helps to step back and think briefly about the inherent difficulties of making jetliners as reliable as we have come to expect. Extreme reliability is hard, especially in complex technologies that operate in unforgiving environments. This is intuitive enough. But the nature of the challenges it poses, and the manner in which the aviation industry has managed those challenges, are both widely misunderstood.

The extreme levels of reliability that we expect of jetliners poses meaningfully different challenges than the “normal” reliability we expect of almost any other system. In essence, this is because designing a system that won’t fail very frequently requires that engineers understand how it will function—and thus fail to function. Engineers can’t just wait for them to crash to learn how reliable they are! The effort required to achieve extreme reliability doesn’t scale proportionally with the desired level of safety. (Such that, for instance, doubling the reliability of a complex system requires more than double the effort.)

Not even the most exhaustive tests and models could hope to capture every subtlety of a jetliner’s real-world performance over billions of hours of operation.

To appreciate the latter relationship, consider the work of building a system that is reliable 99.99 percent of the time (i.e., one that fails no more than once in every 10,000 hours of operation). To achieve this, engineers need to understand how the system will behave over that period of time: the external conditions it might face, how its many elements will interact with those conditions, and a great deal else. And for that they need abstractions—theories, tests, models—that are representative enough of the real world to accurately capture the kinds of eventualities that might occur only once in every 10,000 hours. Such representativeness can be challenging, however, because the real world is “messy” in ways that engineering abstractions never perfectly reproduce, and a lot of unexpectedly catastrophic things can happen in 10,000 hours. An unusual environmental condition might interact with a material in an unanticipated way, causing it to corrode or fatigue. An obscure combination of inputs might cause essential software components to crash or behave erratically. We don’t know what we don’t know, as the old truism goes, so these kinds of things are difficult to anticipate.

Now consider what happens as the reliability required of the system rises from 99.99 percent to 99.999 percent. To achieve this new benchmark engineers need to account for eventualities that might occur not every 10,000 hours, but every 100,000 hours. And so it goes; each new decimal in this “march of nines” represents an order-of-magnitude rise in the obscurity of the factors that engineers need to capture in their abstractions and accommodate in their designs. With each increment, therefore, it becomes increasingly likely that expert’s reliability calculations will be undone by something significant hiding in their understanding of how the system functions: some property, or combination of circumstances that nobody thought to test. (Elsewhere, I have proposed we call such failures “rational accidents.” Partly because they arise from rationally-held but nevertheless erroneous beliefs, and partly because it is rational, epistemologically, to expect them to occur.)

This is the context in which we should understand the reliability of modern jetliners. Viewed through the lens of epistemological uncertainty and its hidden dangers, civil aviation’s safety record over the last few decades is little short of astonishing. The rate of airliner accidents attributable to technological failure implies that their critical systems have mean-times-to-failure not of 10,000 hours, and not even of 100,000 hours, but north of a billion hours. When reckoning with failures over this kind of timescale, even extraordinarily rare factors can become critical engineering considerations: Unexpected interactions or phenomena that might only show up with a particular phase of the moon or alignment of the stars. As a 20th-century engineering achievement, the sheer ordinariness and tedium of modern air travel is on par with the exceptionality and drama of NASA landing on the Moon. And insofar as the laurels for this achievement should be laid at the feet of any one organization, then it has to be Boeing.

The process by which Boeing and its peers achieved this lofty reliability is widely misrepresented and misunderstood. We have long been conditioned to think of engineering as an objective, rule-governed process, and aviation reliability is firmly couched in this language. So it is that the awesome mundanity of modern flight is ostensibly built on ever more detailed engineering analyses and rigorous regulatory oversight: standards, measurements, and calculations. Like sausages and scriptures, however, these formal practices look increasingly spurious when the circumstances of their production are examined closely. Not even the most exhaustive tests and models could hope to capture every subtlety of a jetliner’s real-world performance over billions of hours of operation. While rigorous analysis and oversight are undoubtedly vital, their usefulness wanes long before they can deliver the kinds of reliability jetliners demand. We can manage the performance of most systems in this way, but pushing past the limits and uncertainties of our abstractions to achieve the performance we expect of jetliners requires more. Herein lies the true engineering challenge of civil aeronautics, and the reason why the industry is so difficult for new entrants.

Examined closely, the industry achieved this feat by leveraging a series of pragmatic but ultimately unquantifiable practices. Stripped to their essence, these amount to a process of learning from experience. Engineers calculated and measured everything that could realistically be calculated and measured, then they gradually whittled away at the uncertainties that remained by interrogating failures for marginal insights that had eluded their tests and models. They slowly made jetliners more reliable over time, in other words, by using their failures as a scaffold and guide.

This learning process sounds simple, but it was actually a painful, expensive, decades-long grind, which depended for its success on several longstanding and often challenging institutional commitments. For example, it necessitated a costly dedication to researching the industry’s failures and close calls, and an institutionalized willingness to accept findings of fault (something organizations naturally tend to resist). Perhaps most significantly, it depended on a deep-rooted adherence to a consistent and stable jetliner design paradigm: a willingness to greatly delay, or forgo entirely, implementing tantalizing innovations—new materials, architectures, technologies—that, on paper, promised significant competitive advantages.

Securing bolts properly is about the lowest-hanging fruit of high-reliability engineering.

These vital practices and commitments could never be wholly legislated, audited, and enforced by third parties due to the nuanced and necessarily subjective judgments on which they hinged. Regulators could demand that “new” designs be subjected to far more scrutiny than “light modifications” of prior designs, for instance, but they could never perfectly define what constituted a “light modification.” And, while rules could require that special precautions be taken for “safety-critical” components, the “criticality” of specific components would always be a matter of interpretation.

Huge financial stakes were involved in these ungovernable practices and interpretations, so the cultures in which they made were extremely important. The people making strategic decisions at companies like Boeing (not that there are many companies like Boeing) needed to understand the significance of the choices they were making, and to do that they needed to be able to see past the rule-governed objectivity that frames the safety discourse around modern aviation. They had to realize that in this domain, if in few others, simply ticking every box was not enough. They also needed to be willing, and able, to prioritize expensive, counterintuitive practices over shorter-term economic incentives, and justify their decisions to stakeholders without appeals to quantitative rigor. This made aviation-grade reliability a huge management challenge as well as an engineering challenge.

So how does this understanding of aviation reliability help us make sense of Boeing’s recent missteps with its 737? Seen through this lens, the door-plug drama looks highly unusual in that it appears to have been an avoidable error. This is stranger than it seems. On the rare occasions when jetliner failures are attributable to the airplane’s manufacturer, they are almost always “rational accidents,” with root causes that had hidden in the uncertainties of experts’ tests and models. If the insecure plug was due to missing bolts, then this was something else. Securing bolts properly is about the lowest-hanging fruit of high-reliability engineering. It is the kind of thing that manufacturers ought to be catching with their elaborate rules and oversight, before they even begin their “march of nines.”

We should always hesitate to draw large conclusions from small samples, but a failure this ordinary lends credence to increasingly pervasive accounts of Boeing as a company that has gradually lost its way; its culture and priorities increasingly dominated by MBAs rather than the engineers of old. Especially when that failure is seen in conjunction with the 2019 737-MAX disasters, which were also rooted in avoidable design shortcomings, and the “Starliner” space capsule’s ongoing troubles.

This is probably the failure’s real significance: The underlying shift in institutional culture that it represents. Boeing will surely remedy any specific problem with missing or unsecured bolts; it would be truly incredible if that mistake was ever made again. The fact that the mistake was made at all, however, suggests an organization that is decreasingly inclined, or able, to make the kinds of costly, counterintuitive, and difficult-to-justify choices on which it built its exemplary history of reliability. These choices always pertain to marginal, almost negligible, concerns—simply because reliability at high altitudes is all about the margins—so their consequences manifest slowly. But their effects are cumulative and inexorable. A company that is not securing its bolts correctly is unlikely to be making the kinds of strategic decisions that pay dividends in decades to come.

John Downer is Associate Professor in Science and Technology Studies at the University of Bristol, and the author of “Rational Accidents.”