chapter six

Chapter 6. Programming Practices

This chapter covers

Best practices unique to developing Hadoop programs
Debugging programs in local, pseudo-distributed, and fully distributed modes
Sanity checking and regression testing program outputs
Logging and monitoring
Performance tuning

Now that you’ve gone through various programming techniques in MapReduce, this chapter will step back and cover programming practices.

Programming on Hadoop differs from traditional programming mainly in two ways. First, Hadoop programs are primarily about processing data. Second, Hadoop programs are run over a distributed set of computers. These two differences will change some aspects of your development and debugging processes, which we cover in sections 6.1 and 6.2.

Performance tuning techniques tend to be specific to the programming platform, and Hadoop is no different. We cover tools and approaches to optimizing Hadoop programs in section 6.3.

Let’s start with the development techniques applicable to Hadoop. Presumably you’re already familiar with standard Java software engineering techniques. We focus on practices unique to data-centric programming within Hadoop.

Chapter 6. Programming Practices

This chapter covers

6.1. Developing MapReduce programs

6.2. Monitoring and debugging on a production cluster

6.3. Tuning for performance

6.4. Summary