Performance & Data Eng

Blind Spots in Your System: The Grave Risks of Overlooking Observability

I was an Enterprise Data Architect at Boeing NASA Systems the day of the Columbia Shuttle disaster. The tragedy has had a profound impact in my career on how I look at data. Recently I have […]

  • 8 min read

I was an Enterprise Data Architect at Boeing NASA Systems the day of the Columbia Shuttle disaster. The tragedy has had a profound impact in my career on how I look at data. Recently I have been struck by how, had it been available back then, purpose-built AI observability might have altered the course of events. 

The Day of the Disaster

With complete reverence and respect for the Columbia disaster, I can remember that day with incredible amplification. I was standing in front of my TV watching the Shuttle Columbia disintegrate on reentry. As the Enterprise Data Architect at Boeing NASA Systems, I received a call from my boss: “We have to pull and preserve all the Orbitor data. I will meet you at the office.” Internally, Boeing referred to the Shuttles as Oribitors. It was only months earlier I had the opportunity to share punch and cookies with some of the crew. To say I was being flooded with emotions would be a tremendous understatement.

When I arrived at Boeing NASA Systems, near Johnson Space Center, the mood in the hallowed aeronautical halls was somber. I sat intently, with eyes scanning lines of data from decades of space missions. The world had just witnessed a tragedy — the Columbia Shuttle disintegrated on reentry, taking the lives of seven astronauts. The world looked on in shock and sought answers. As a major contractor for NASA, Boeing was at the forefront of this investigation.

As the Enterprise Data Architect, I was one of an army helping dissect the colossal troves of data associated with the shuttle, looking for anomalies, deviations — anything that could give a clue as to what had gone wrong. Days turned into nights, and nights turned into weeks and months as we tirelessly pieced together Columbia’s final moments. But as we delved deeper, a haunting reality began to emerge.

Every tiny detail of the shuttle was monitored, from the heat patterns of its engines to the radio signals it emitted. But there was a blind spot, an oversight that no one had foreseen. In the myriad of data sets, there was nothing that indicated the effects of a shuttle’s insulation tiles colliding with a piece of Styrofoam, especially at speeds exceeding 500 miles per hour.

The actual incident was seemingly insignificant — a piece of foam insulation had broken off and struck the shuttle’s left wing. But in the vast expanse of space and the brutal conditions of reentry, even minor damage could prove catastrophic.

Video footage confirmed: the foam had struck the shuttle. But without concrete data on what such an impact would do, the team was left to speculate and reconstruct potential scenarios. The lack of this specific data had turned into a gaping void in the investigation.

As a seasoned Enterprise Data Architect, I always believed in the power of information. I absolutely believed that in the numbers, in the bytes and bits, we find the stories that the universe whispers to us. But this time, the universe had whispered a story that we didn’t have the data to understand fully.

Key Findings

After the accident, NASA formed the Columbia Accident Investigation Board (CAIB) to investigate the disaster. The board consisted of experts from various fields outside of NASA to ensure impartiality.

1. Physical Cause: The CAIB identified the direct cause of the accident as a piece of foam insulation from the shuttle’s external fuel tank that broke off during launch. This foam struck the leading edge of the shuttle’s left wing. Although foam shedding was a known issue, it had been wrongly perceived as non-threatening because of prior flights where the foam was lost but didn’t lead to catastrophe.

2. Organizational Causes: Beyond the immediate physical cause, the CAIB highlighted deeper institutional issues within NASA. They found that there were organizational barriers preventing effective communication of safety concerns. Moreover, safety concerns had been normalized over time due to prior incidents that did not result in visible failure. Essentially, since nothing bad had happened in prior incidents where the foam was shed, the practice had been erroneously deemed “safe.”

3. Decision Making: The CAIB pointed out issues in decision-making processes that relied heavily on past success as a predictor of future success, rather than rigorous testing and validation.

4. Response to Concerns: There were engineers who were concerned about the foam strike shortly after Columbia’s launch, but their requests for satellite imagery to assess potential damage were denied. The reasons were multifaceted, ranging from beliefs that nothing could be done even if damage was found, to a misunderstanding of the severity of the situation.

The CAIB made a number of recommendations to NASA to improve safety for future shuttle flights. These included:

1. Physical Changes: Improving the way the shuttle’s external tank was manufactured to prevent foam shedding and enhancing on-orbit inspection and repair techniques to address potential damage.

2. Organizational Changes: Addressing the cultural issues and communication barriers within NASA that led to the accident, and ensuring that safety concerns were more rigorously addressed.

3. Continuous Evaluation: Establishing an independent Technical Engineering Authority responsible for technical requirements and all waivers to them, and building an independent safety program that oversees all areas of shuttle safety.

Could Purpose-built AI Observability Have Helped?

In the aftermath, NASA grounded the shuttle fleet for more than two years after the disaster. They then implemented the CAIB’s recommendations before resuming shuttle flights. Columbia’s disaster, along with the Challenger explosion in 1986, are stark reminders of the risks of space travel and the importance of a diligent and transparent safety culture. The lessons from Columbia shaped many of the safety practices NASA follows in its current human spaceflight programs.

The Columbia disaster led to profound changes in how space missions were approached, with a renewed emphasis on data collection and eliminating informational blind spots. But for me, it became a deeply personal mission. I realized that sometimes, the absence of data could speak louder than the most glaring of numbers. It was a lesson I would carry throughout my career, ensuring that no stone was left unturned, and no data point overlooked.

The Columbia disaster, at its core, was a result of both a physical failure (foam insulation striking the shuttle’s wing) and organizational oversights (inadequate recognition and response to potential risks). Purpose-built AI Observability, which involves leveraging artificial intelligence to gain insights into complex systems and predict failures, might have helped in several key ways:

1. Real-time Anomaly Detection: Modern AI systems can analyze vast amounts of data in real time to identify anomalies. If an AI-driven observability platform had been monitoring the shuttle’s various sensors and systems, it might have detected unexpected changes or abnormalities in the shuttle’s behavior after the foam strike, potentially even subtle ones that humans might overlook.

2. Historical Analysis: An AI system with access to all previous shuttle launch and flight data might have detected patterns or risks associated with foam-shedding incidents, even if they hadn’t previously resulted in a catastrophe. The system could then raise these as potential long-term risks.

3. Predictive Maintenance: AI-driven tools can predict when components of a system are likely to fail based on current and historical data. If applied to the shuttle program, such a system might have provided early warnings about potential vulnerabilities in the shuttle’s design or wear-and-tear.

4. Decision Support: AI systems could have aided human decision-makers in evaluating the potential risks of continuing the mission after the foam strike, providing simulations, probabilities of failure, or other key metrics to help guide decisions.

5. Enhanced Imaging and Diagnosis: If equipped with sophisticated imaging capabilities, AI could analyze images of the shuttle (from external cameras or satellites) to detect potential damage, even if it’s minor, and then assess the risks associated with such damage.

6. Overcoming Organizational Blind Spots: One of the major challenges in the Columbia disaster was the normalization of deviance, where foam shedding became an “accepted” risk because it hadn’t previously caused a disaster. An AI system, being objective, doesn’t suffer from these biases. It would consistently evaluate risks based on data, not on historical outcomes.

7. Alerts and Escalations: An AI system can be programmed to escalate potential risks to higher levels of authority, ensuring that crucial decisions don’t get caught in bureaucratic processes.

While AI Observability could have provided invaluable insights and might have changed the course of events leading to the Columbia disaster, it’s essential to note that the integration of such AI systems also requires organizational openness to technological solutions and a proactive attitude toward safety. The technology is only as effective as the organization’s willingness to act on its findings.

The tragedy served as a grim reminder for organizations worldwide: It’s not just about collecting data; it’s about understanding the significance of what isn’t there. Because in those blind spots, destiny can take a drastic turn.

In Memory and In Action

The Columbia crew and their families deserve our utmost respect and admiration for their unwavering commitment to space exploration and the betterment of humanity. 

  • Rick D. Husband: As the Commander of the mission, Rick led with dedication, confidence, and unparalleled skill. His devotion to space exploration was evident in every decision he made. We remember him not just for his expertise, but for his warmth and his ability to inspire those around him. His family’s strength and grace, in the face of the deepest pain, serve as a testament to the love and support they provided him throughout his journey.
  • William C. McCool: As the pilot of Columbia, William’s adeptness and unwavering focus were essential to the mission. His enthusiasm and dedication were contagious, elevating the spirits of everyone around him. His family’s resilience and pride in his achievements are a reflection of the man he was — passionate, driven, and deeply caring.
  • Michael P. Anderson: As the payload commander, Michael’s role was vital, overseeing the myriad of experiments and research aboard the shuttle. His intellect was matched by his kindness, making him a cherished member of the team. His family’s courage and enduring love encapsulate the essence of Michael’s spirit — bright, optimistic, and ever-curious.
  • Ilan Ramon: As the first Israeli astronaut, Ilan represented hope, unity, and the bridging of frontiers. His enthusiasm for life was infectious, and he inspired millions with his historic journey. His family’s grace in the face of the unthinkable tragedy is a testament to their shared dream and the values that Ilan stood for.
  • Kalpana Chawla: Known affectionately as ‘KC’, Kalpana’s journey from a small town in India to becoming a space shuttle mission specialist stands as an inspiration to countless dreamers worldwide. Her determination, intellect, and humility made her a beacon of hope for many. Her family’s dignity and strength, holding onto her legacy, reminds us all of the power of dreams and the sacrifices made to realize them.
  • David M. Brown: As a mission specialist, David brought with him a zest for life, a passion for learning, and an innate curiosity that epitomized the spirit of exploration. He ventured where few dared and achieved what many only dreamt of. His family’s enduring love and their commitment to preserving his memory exemplify the close bond they shared and the mutual respect they held for each other.
  • Laurel B. Clark: As a mission specialist, Laurel’s dedication to scientific exploration and discovery was evident in every task she undertook. Her warmth, dedication, and infectious enthusiasm made her a beloved figure within her team and beyond. Her family’s enduring spirit, cherishing her memories and celebrating her achievements, is a tribute to the love and support that were foundational to her success.

To each of these remarkable individuals and their families, we extend our deepest respect and gratitude. Their sacrifices and contributions will forever remain etched in the annals of space exploration, reminding us of the human spirit’s resilience and indomitable will.

For those of us close to the Columbia disaster, it was more than a failure; it was a personal loss. Yet, in memory of those brave souls, we are compelled to look ahead. In the stories whispered to us by data, and in the painful lessons from their absence, we seek to ensure that such tragedies remain averted in the future.

While no technology can turn back time, the promise of AI Observability beckons a future where every anomaly is caught, every blind spot illuminated, and every astronaut returns home safely.

The above narrative seeks to respect the gravity of the Columbia disaster while emphasizing the potential of AI Observability. It underlines the importance of data, both in understanding tragedies and in preventing future ones.