Appendix B. More about the workings of HDFS

 

Hadoop Distributed File System (HDFS) is the underlying distributed file system that is the most common choice for running HBase. Many HBase features depend on the semantics of the HDFS to function properly. For this reason, it’s important to understand a little about how the HDFS works. In order to understand the inner working of HDFS, you first need to understand what a distributed file system is. Ordinarily, the concepts at play in the inner workings of a distributed file system can consume an entire semester’s work for a graduate class. But in the context of this appendix, we’ll briefly introduce the concept and then discuss the details you need to know about HDFS.

B.1. Distributed file systems

Traditionally, an individual computer could handle the amount of data that people wanted to store and process in the context of a given application. The computer might have multiple disks, and that sufficed for the most part—until the recent explosion of data. With more data to store and process than a single computer could handle, we somehow needed to combine the power of multiple computers to solve these new storage and compute problems. Such systems in which a network of computers (also sometimes referred to as a cluster) work together as a single system to solve a certain problem are called distributed systems. As the name suggests, the work is distributed across the participating computers.

B.2. Separating metadata and data: NameNode and DataNode

B.3. HDFS write path

B.4. HDFS read path

B.5. Resilience to hardware failures via replication

B.6. Splitting files across multiple DataNodes