chapter seven

7 Reliability

 

This chapter covers: - Fundamentals: Core reliability concepts. - Analysis: Concepts focused on analysis used during troubleshooting and debugging.

7.1 Fundamentals

7.1.1 Reliability Introduction

Reliability is a term that is widely used today. Yet, the more a term is used, the more ambiguous it can become.

In this section, I’m not going to formally define the concept of reliability, but instead, I’m going to share a story that changed my whole career.

A few years ago, I joined a new company in a safety-critical domain: air traffic management. My first day there is one I will probably remember for the rest of my life. We were doing a training session with all the newcomers, seated in a large conference room and casually waiting for the session to begin. In front of us was the trainer.

After a brief introduction, the trainer asked us to make a roundtable to explain where we came from. People took turns explaining their backgrounds. For example, when it was my turn, I mentioned that I came from the insurance industry. Once everyone had shared his experience, the trainer paused for a moment and then said:

“There’s something important that you should all realize by joining our company: if we have a problem, we may not lose money, we may not lose customers, but we may lose lives.”

7.1.2 Graceful Degradation

7.2 Adaptive LIFO

7.3 Resilient, Fault-tolerant, Robust, or Reliable?

7.3.1 Resilient

7.3.2 Fault-Tolerant

7.3.3 Robust

7.3.4 Reliable

7.4 Fail Open vs. Fail Closed

7.5 Soft vs. Hard Dependency

7.5.1 Why It Matters

7.5.2 Soft or Hard Dependency?

7.5.3 Evolutions Over Time

7.5.4 Improving Reliability

7.6 Analysis

7.6.1 Post Hoc Ergo Propter Hoc

7.7 Lurking Variables

7.7.1 Scenario

7.7.2 Correlation vs. Causation

7.7.3 Lurking Variable

7.7.4 Avoiding Lurking Variables in Analysis

7.7.5 Detecting Lurking Variables with Data Segmentation

7.8 Summary