Chapter 6. Programming Practices

 

This chapter covers

  • Best practices unique to developing Hadoop programs
  • Debugging programs in local, pseudo-distributed, and fully distributed modes
  • Sanity checking and regression testing program outputs
  • Logging and monitoring
  • Performance tuning

Now that you’ve gone through various programming techniques in MapReduce, this chapter will step back and cover programming practices.

Programming on Hadoop differs from traditional programming mainly in two ways. First, Hadoop programs are primarily about processing data. Second, Hadoop programs are run over a distributed set of computers. These two differences will change some aspects of your development and debugging processes, which we cover in sections 6.1 and 6.2.

Performance tuning techniques tend to be specific to the programming platform, and Hadoop is no different. We cover tools and approaches to optimizing Hadoop programs in section 6.3.

Let’s start with the development techniques applicable to Hadoop. Presumably you’re already familiar with standard Java software engineering techniques. We focus on practices unique to data-centric programming within Hadoop.

6.1. Developing MapReduce programs

6.2. Monitoring and debugging on a production cluster

6.3. Tuning for performance

6.4. Summary