What Software Engineers Can Learn From Air Crashes

Recently, I’ve been watching Air Crash Investigation, a documentary about real air crashes in history. My favorite part is how investigators research every perspective that contributes to a crash and how the aviation industry learns from it, either by implementing new rules or improving airplane design.

In most of the episodes, the air crashes are caused by human error. Yes, the investigators and regulators could simply blame the crash on the specific individual, perhaps for not being smart enough or for lacking patience. However, the aviation industry doesn’t stop there. Instead, they simplify rules or add automated systems to increase tolerance for human error.

In comparison, I think software engineers and managers do not learn enough from crashes or failures. People in this industry are good at designing systems to tolerate computer errors. For example, they build cloud architectures to handle such failures. But when it comes to managing engineers, they seem to forget the same principles. Managers often focus on finding out who is responsible for a delay in publishing or a service outage, instead of automating the development process or improving the tools available to engineers.

Software engineering is a software. Software production involves organizing the intellectual work of engineers, just as software itself organizes functions, objects, or procedures. Engineers distill their experiences into design principles. But if managers simply respond with more punishments, how is that any different from praying that every component will work properly in a cloud-based system?

Air crashes are severe, but the industry humbly learns from them, and airplanes have become much safer as a result. Software crashes, on the other hand, often just make the engineers’ jobs even harder.