About this Book
Hadoop is an open source framework implementing the MapReduce algorithm behind Google’s approach to querying the distributed data sets that constitute the internet. This definition naturally leads to an obvious question: What are maps and why do they need to be reduced? Massive data sets can be extremely difficult to analyze and query using traditional mechanisms, especially when the queries themselves are quite complicated. In effect, the MapReduce algorithm breaks up both the query and the data set into constituent parts—that’s the mapping. The mapped components of the query can be processed simultaneously—or reduced—to rapidly return results.
This book teaches readers how to use Hadoop and write MapReduce programs. The intended readers are programmers, architects, and project managers who have to process large amounts of data offline. This book guides the reader from obtaining a copy of Hadoop to setting it up in a cluster and writing data analytic programs.
The book begins by making the basic idea of Hadoop and MapReduce easier to grasp by applying the default Hadoop installation to a few easy-to-follow tasks, such as analyzing changes in word frequency across a body of documents. The book continues through the basic concepts of MapReduce applications developed using Hadoop, including a close look at framework components, use of Hadoop for a variety of data analysis tasks, and numerous examples of Hadoop in action.