Learning From Mistakes: 5 Lessons from NASA's Challenger Disaster

Downloadable PDF

Big mistakes may grab the headlines, but it’s the path leading up to the big mistakes that intrigues me. Seemingly inconsequential mistakes, compounded over time, are the foundation for disasters.

Common wisdom says we should learn from our mistakes. Easy to say, tougher to implement. There’s a tendency to look for a smoking gun – find the people that screwed up and fire them. Or find the one piece of technology that failed and replace it. If only it was that easy.

Problems are rarely traced to a single cause. Organizations often fall into the trap of seeking a convenient scapegoat because it’s easy and convenient, not because it reveals the truth. There are several conflicting factors that complicate learning from mistakes:

·         Poor environmental design – we underestimate how external environments have influenced decisions

·         Misaligned incentives – people will respond to actual incentives, which may be different than what an organization intended

·         Poor communication – lack of honest and direct communication - no one says what they really think.

·         Competing priorities – boards of directors, executives, managers, and frontline workers all have different goals, desires, and concerns.

Creating an organization capable of identifying issues beforehand is essential. Unfortunately, organizational dynamics can make it hard to get in front of potential issues. So how should we ensure that our organizations are executing properly? How can we build a robust organization that is open to recognizing mistakes, rather than protecting egos?

Fortunately, we can learn from one of the great disasters in U.S. history: The Space Shuttle Challenger. The Challenger disaster showcases the difficulties of developing an organization receptive to recognizing its mistakes.

Most readers are familiar with the story. On January 28, 1986, the Space Shuttle Challenger launched from Florida. 73 seconds into the mission, the shuttle exploded over the Atlantic Ocean. There were no survivors. The proximate cause was the failure of an O-ring inside the right solid rocket booster. 1

It’s not surprising that high-risk technologies, like space exploration, have accidents. The risk of failure can never be driven down to zero.

But the Challenger situation presented a unique situation. After the investigation, investigators found that it wasn’t an unpredictable issue that caused the catastrophe.

Instead, it was a series of repeated, avoidable mistakes.

It wasn’t a failure of intelligence. It was a combination of organizational design flaws, poor incentives, and behavioral biases.

It’s important to reiterate:

·         It wasn’t one person that screwed up

·         It wasn’t due to bureaucracy

·         It wasn’t due to bad leadership

·         It wasn’t due to high-risk technology

The unpredictable interaction of all these factors led to the mistake. Each factor alone wasn’t enough to cause the disaster. The unique combination was the cause. And that combination is what makes it so difficult for an organization to rectify these problems in advance. The CEO can’t do it alone. Nor can anyone else do it on their own. It’s a cultural effort that can never stop.

By internalizing the following lessons, we can have a deeper appreciation for the difficulties of creating a responsive and ego-free organization.

1)      The Danger of Arrogance & Ego

Laurence Gonzales, author of Everyday Survival, recapped several mistakes inside NASA. The first notable problem was the growing sense of arrogance and ego:

The combination of groupness and persistent mental models made for an organization that could not take in new information when that information did not accord with its indelible concept of itself as the “perfect place.”2

It’s a challenge to stay humble when you work for an elite organization. It’s hard not to believe in your own greatness after continued success.

That’s why we need other people to fight the ego we all have. It reveals our fallibility that we ourselves would otherwise miss.

NASA didn’t do this. As Gonzales explains:

Because NASA believed that “we’re the best” and that “failure is not an option,” all information tended to support that conclusion, no matter how contrary it might have seemed to an outsider.3

Every time NASA succeeded, its ego grew. That ego prevented honest recognition of their mistakes in their process.

The best way to prevent this is to expose your ideas and process to outsiders. Others can help you, but only if you let them. If you create a protected shell around your team, mistakes will just build. Comfortability replaces excellence. It works until it doesn’t.

2)      Seek Disconfirming Evidence

In the case of Challenger, engineers were faced with the fact that fuel in the solid rocket boosters was burning through the rubber O rings that sealed the scenes where sections of the rockets were joined. Groupness dictated that no one outside the immediate culture was fit to judge the fruits of their labors. Confirmation bias is a phenomenon in psychology by which people tend to take any information as confirmation of what they already believe. In addition, they tend to ignore or miss any information that doesn’t confirm what they already believe.4

Confirmation bias is a close companion to ego. We readily accept our good outcomes but conveniently explain away our faults.

It takes time to produce a bad outcome. It’s not instantaneous. Disconfirming evidence is our tool to negate the ego.

Because ego can build without our conscious attention, it has the unfortunate effect of leading to really big disasters.

In NASA’s case, it culminated with the Challenger:

Each of the behaviors that have wound up contributing to a predicament has its origins in a large investment of effort that produced triumphant results. Despite signs of trouble, we adjust our mental models to accommodate larger deviations from the norm. Our groupness helps to keep bad news from upsetting our view. Without a mechanism for reframing our behavior or redefining our group, the effects are ignored, as they were at NASA, until a catastrophe happens.5

It’s a challenge to overcome confirmation bias. Most of us work and study hard. We have decades of experience. It feels odd that we still don’t know everything. It seems like there should be a point where we don’t need to learn.

In complex activities like space exploration, you never get to take a break. Complex environments change too fast to allow rest. As soon as you think you’ve figured everything out, you’re done.

3)   Low Tolerance for Shortcuts

Successful organizations have low tolerances for shortcuts. It’s not because they like to micromanage. Instead, they know that as an organization expands, it becomes impossible to monitor the process from a top-down perspective. It’s up to individuals and a disciplined process to avoid disaster.

NASA learned this the hard way:

Engineers and managers incorporated worsening anomalies into the engineering experience base, which functioned as an elastic waistband, expanding to hold larger deviations from the original design. Anomalies that did not lead to catastrophic failure were treated as a source of valid engineering data that justified further flights.6

Once shortcuts infiltrate the process, it becomes hard to track their effects. They become accepted as a cultural norm. As seen in the Challenger example, errors start to have a numbing effect on the organization. Workers and managers ignore the warning signs. Since things aren’t falling apart, people assume everything must be fine.

As with any complex system, risk doesn’t grow linearly, but exponentially. As mistakes accumulate, they begin to interact with one another. Any mistake by itself may not be an issue. But combine it with several other mistakes, and you create a potential problem that is unpredictable to the organization.

Gonzales continues:

We have also found that certification criteria used in Flight Readiness Reviews often develop a gradually decreasing strictness. The argument that the same risk was flown before without failure is often accepted as an argument for the safety of accepting it again. Because of this, obvious weaknesses are accepted again and again, sometimes without a sufficiently serious attempt to remedy them, or to delay a flight because of their continued presence.7

Instead of investigating deviations, NASA deflected responsibility by falling back on its history of success, not realizing the growing and accumulating flaws in their process:

In fact, previous NASA experience had shown, on occasion, just such difficulties, near accidents, and accidents, all giving warning that the probability of flight failure was not so very small. The inconsistency of the argument not to determine reliability through historical experience, as the range safety officer did, is that NASA also appeals to history, beginning "Historically this high degree of mission success..."8

4)      Beware of Empty Rhetoric

Engineers decided with data. Management decided with fantasy. Gonzales highlights the growing difference of opinions between NASA engineers and management:

Finally, if we are to replace standard numerical probability usage with engineering judgment, why do we find such an enormous disparity between the management estimate and the judgment of the engineers? It would appear that, for whatever purpose, be it for internal or external consumption, the management of NASA exaggerates the reliability of its product, to the point of fantasy.9

Famed physicist Richard Feynman, who led the investigation into the disaster, confirmed:

'I saw considerable flaws in their logic. I found they were making up numbers not based on experience. NASA's engineering judgment was not the judgment of its engineers.10

Past success tends to enlarge one’s ego rather than ability. You can get away with it for a while, but eventually the truth is revealed. It’s tempting to exaggerate your organization’s ability and success. It’s a natural consequence of believing in your team and wanting to keep everyone, including those above you, in good morale. However, exaggerations have a way of growing over time. What starts as a little embellishment soon turns into excessive and unrealistic assessment of future performance and ability.

As many readers know, NASA has had not one, but two space shuttle disasters. On February 1, 2003, the Space Shuttle Columbia broke apart upon re-entry into the atmosphere. Gonzales explains:

This is how a simple mental model can be expected to function. It operates on a simple rule: if nothing bad happens, we must be doing something right. So influential were NASA’s models and scripts, and so delusional it’s self-confidence bread of groupness, that even after Columbia broke up, killing all on board, the space shuttle program manager told the press that he was comfortable with his previous assessments of risk and didn’t think the foam debris had caused the accident. Remember that a key feature of the system is that, taken one small step at a time, each decision always seems correct.11

It’s critical to manage expectations from the beginning. That doesn’t mean you downplay your team or ability. But you must understand how luck and external factors drive success. If you start to take credit for things outside your control, what are you going to do when they’re not there? Leaders need to distinguish the success that’s attributable to their process versus ordinary luck. 

If you see flaws in your process, you need to immediately assess and correct, even as you continue to succeed.

Success shouldn’t drive confidence; the robustness of your process drives confidence.

Success is a lagging indicator. It doesn’t reveal anything about your process today. It only shows what was good in the past.

You need to focus on what you can control. Your process. Your team. Your daily activity.

Leaders have to maintain disciplined standards during the good times. It’s easy to get sloppy when successful. When things go wrong, any idiot can find things to criticize. But by then, it’s too late. 

5)      Team Communication

In any organization, how things work in theory and how things work in practice are light years apart. The executives have their idealized views, while the workers have the true sense of how things work.

Gonzales explains how this worked in NASA: 

[NASA] Official management, on the other hand, claims to believe the probability of failure is a thousand times less. One reason for this may be an attempt to assure the government of NASA perfection and success in order to ensure the supply of funds. The other may be that they sincerely believed it to be true, demonstrating an almost incredible lack of communication between themselves and their working engineers.12

In any high-risk organization, there needs to be direct communication about problems, issues, and errors. Once you start censuring and conforming your opinions to meet cultural norms, you begin to put internal politics above success. Once the trust of open communication is lost, leaders can never be sure of what is going on at the operational level.

It’s rare to see colleagues openly debate and discuss important issues. It’s uncomfortable to discuss things that aren’t working well. It’s easier to talk about change than it is to implement it. It’s easier to assume it’s all going well rather than investigate.

Great organizations minimize the political and managerial impediments to honest and immediate communication. It can never be eliminated. It starts at the top. Leaders who actively commit time and energy into discussing potential issues prove to the team it’s importance. Leaders can’t just talk about it. Leaders must do it. And make time for it. Workers aren’t stupid. They’ll figure out what ultimately gets rewarded. As a leader, it’s about what you do, not about what you say.

To Summarize the Five Takeaways:

·         Eliminate Arrogance & Ego

·         Seek Disconfirming Evidence

·         Have a Low Tolerance for Shortcuts

·         Beware of Empty Rhetoric

·         Emphasize Team Communication