chapter seven

7 Reliability

 

This chapter covers

  • Distinguishing between Fail-safe (prioritizing safety) and Fail-closed (prioritizing security)
  • Classifying dependencies as Hard or Soft to manage graceful degradation
  • Identifying Lurking Variables that distort root cause analysis
  • Avoiding the "Post hoc ergo propter hoc" fallacy during troubleshooting
  • Overcoming confirmation bias when diagnosing incidents

In software development, we often define done as merged and deployed. But for the end user, done means available and working. Reliability is the discipline that bridges this gap. It's not just about preventing crashes; it's about ensuring that when things fail — and they will — they fail in a way that minimizes harm.

7.1 Fundamentals

7.1.1 Reliability Introduction

7.1.2 Graceful Degradation

7.2 Adaptive LIFO

7.3 Resilient, Fault-tolerant, Robust, or Reliable?

7.3.1 Resilient

7.3.2 Fault-Tolerant

7.3.3 Robust

7.3.4 Reliable

7.4 Fail Open vs. Fail Closed

7.5 Soft vs. Hard Dependency

7.5.1 Why It Matters

7.5.2 Soft or Hard Dependency?

7.5.3 Evolutions Over Time

7.5.4 Improving Reliability

7.6 Analysis

7.6.1 Post Hoc Ergo Propter Hoc

7.7 Lurking Variables

7.7.1 Scenario

7.7.2 Correlation vs. Causation

7.7.3 Lurking Variable

7.7.4 Avoiding Lurking Variables in Analysis

7.7.5 Detecting Lurking Variables with Data Segmentation

7.8 Summary