Chapter 1. Hadoop in a heartbeat

 

This chapter covers

  • Examining how the core Hadoop system works
  • Understanding the Hadoop ecosystem
  • Running a MapReduce job

We live in the age of big data, where the data volumes we need to work with on a day-to-day basis have outgrown the storage and processing capabilities of a single host. Big data brings with it two fundamental challenges: how to store and work with voluminous data sizes, and more important, how to understand data and turn it into a competitive advantage.

Hadoop fills a gap in the market by effectively storing and providing computational capabilities for substantial amounts of data. It’s a distributed system made up of a distributed filesystem, and it offers a way to parallelize and execute programs on a cluster of machines (see figure 1.1). You’ve most likely come across Hadoop because it’s been adopted by technology giants like Yahoo!, Facebook, and Twitter to address their big data needs, and it’s making inroads across all industrial sectors.

Figure 1.1. The Hadoop environment is a distributed system that runs on commodity hardware.

Because you’ve come to this book to get some practical experience with Hadoop and Java,[1] I’ll start with a brief overview and then show you how to install Hadoop and run a MapReduce job. By the end of this chapter, you’ll have had a basic refresher on the nuts and bolts of Hadoop, which will allow you to move on to the more challenging aspects of working with it.

1.1. What is Hadoop?

1.2. Getting your hands dirty with MapReduce

1.3. Chapter summary

sitemap