Chapter 2. Starting Hadoop

 

This chapter covers

  • The architectural components of Hadoop
  • Setting up Hadoop and its three operating modes: standalone, pseudo-distributed, and fully distributed
  • Web-based tools to monitor your Hadoop setup

This chapter will serve as a roadmap to guide you through setting up Hadoop. If you work in an environment where someone else sets up the Hadoop cluster for you, you may want to skim through this chapter. You’ll want to understand enough to set up your personal development machine, but you can skip through the details of configuring the communication and coordination of various nodes.

After discussing the physical components of Hadoop in section 2.1, we’ll progress to setting up your cluster in sections 2.2. and 2.3. Section 2.3 will focus on the three operational modes of Hadoop and how to set them up. You’ll read about web-based tools that assist monitoring your cluster in section 2.4.

2.1. The building blocks of Hadoop

We’ve discussed the concepts of distributed storage and distributed computation in the previous chapter. Now let’s see how Hadoop implements those ideas. On a fully configured cluster, “running Hadoop” means running a set of daemons, or resident programs, on the different servers in your network. These daemons have specific roles; some exist only on one server, some exist across multiple servers. The daemons include

  • NameNode
  • DataNode
  • Secondary NameNode
  • JobTracker
  • TaskTracker

We’ll discuss each one and its role within Hadoop.

2.2. Setting up SSH for a Hadoop cluster

2.3. Running Hadoop

2.4. Web-based cluster UI

2.5. Summary