Chain of Events

Posted by bob on February 27, 2016

Have you ever watched or read a story of an air crash investigation? These intriguing mysteries include an unfortunate level of horror. Real lives are lost or changed forever. Survivors can suffer endless guilt—and many of the stories have no survivors of the event itself. But the investigations have a lot in common with good debug methods. For that reason alone, these stories are worth an occasional visit.

One recurring theme in many air crash stories is the presence of a chain-of-events. The accident might never have occurred if just one action had interrupted the sequence of steps that combined to create the final tragedy.

Two key tools that we all have heard about are the Flight Data Recorder and Cockpit Voice Recorder. These rugged instruments are designed to survive all kinds of mishap (high-speed impact, fire, and flood) and deliver clear recordings of the conditions which existed in the moments before the terminal event—which typically stops the information recording.

If you are developing a computerized system involving hardware and software, you need to think about building some kind of Flight Data Recorder into your system. Instrumentation of the key parameters can re-pay their cost many times over. 

Such costs might show up in requirements for additional volatile or non-volatile memory (typically added RAM and Flash memory). Another method is to allocate a reasonably high-speed data port which constantly spews out diagnostic information. An external function can then be used to capture that diagnostic information.

Unfortunately, for security reasons, such a diagnostic port might reveal too much information about the internal workings of the system. That means that the diagnostic information might need to be encrypted (or at least obfuscated) and this can add further requirements for extra memory or computing bandwidth.

This is why you should plan for the Flight Data Recorder from the beginning of your project. But what if you or your team did not have the foresight to include such functions? If your system does not put human lives or objects of great value at risk, you often can simply add diagnostic and debug code into a computer environment. The simplest and most obvious of these is the classic printf statement—which assumes that you have a console available through which to send messages.

Another great strategy is to assign some unused General Purpose I/O (GPIO) lines to specific diagnostic functions. For example, when the code reaches a particular function, you set the GPIO line HIGH and then when the function exits, you return that same pin to a LOW state. A pulse will appear on that GPIO line and can be monitored with an oscilloscope (or even a simple LED). I have debugged complex systems where this simple tactic provided enough information to realize that a key section of software was never being called, or was being called repeatedly when I only expected a single entry and exit of that code.

On a completely different note, many air crash investigations reveal that the flight crew did not understand some indication or function of the airplane itself. The aircraft designer was sure that everyone understood the meaning of a particular alarm or indicator. Then a crash reveals that nobody in the cockpit responded appropriately. They either did not understand the situation; or their training was not good enough to over-ride some instinctive reaction.

So sometimes a debug reveals that the human interface of a system was far more confusing than the designer ever imagined. That designer was sure that everybody understood a particular word or symbol. Yet, the failure investigation shows that almost nobody knew what that word or symbol was intended to indicate.

There are often investigations that must do analysis of a damaged or failed component. If the system failure (crash) is dramatic enough, it can be very difficult to know if the stress that ruined a part was a significant cause or just an effect of the crash. Investigators become expert at asking questions like, “Can we tell if the engines were still running, when the plane hit the ground?”

You need to become an expert at asking the right questions about your system failures.

Once you have enough information, whether from data recorders, component studies, code reviews, or human factor studies, you can finally start to trace back the chain of events that created your system failure. Your final challenge is to implement improvements that prevent the chain-of-events, without introducing new modes of failure.

To be a great designer, you need to be a great debugger, which is just another word for problem-solver. To be a good debugger, you need to learn how to be a good tester; and ultimately a good investigator. Work your way backwards through the evidence on hand, and make sure that your systems give you lots of good evidence.