chapter seven
7 Reliability
This chapter covers
- Distinguishing between Fail-safe (prioritizing safety) and Fail-closed (prioritizing security)
- Classifying dependencies as Hard or Soft to manage graceful degradation
- Identifying Lurking Variables that distort root cause analysis
- Avoiding the "Post hoc ergo propter hoc" fallacy during troubleshooting
- Overcoming confirmation bias when diagnosing incidents
In software development, we often define done as merged and deployed. But for the end user, done means available and working. Reliability is the discipline that bridges this gap. It's not just about preventing crashes; it's about ensuring that when things fail — and they will — they fail in a way that minimizes harm.