concept Hadoop in category machine learning

This is an excerpt from Manning's book Introducing Data Science: Big data, machine learning, and more, using Python tools.
Best however to avoid reading the log window for now. At this point, it’s misleading. If this is your first query, then it could take 30 seconds. Hadoop is famous for its warming periods. That discussion is for later, though.
Spark is a cluster computing framework similar to MapReduce. Spark, however, doesn’t handle the storage of files on the (distributed) file system itself, nor does it handle the resource management. For this it relies on systems such as the Hadoop File System, YARN, or Apache Mesos. Hadoop and Spark are thus complementary systems. For testing and development, you can even run Spark on your local system.
We had already downloaded and unzipped the file in listing 5.1; now in listing 5.2 we made a sub-selection of the data using Pandas and stored it locally. Then we created a directory on Hadoop and transferred the local file to Hadoop. The downloaded data is in .CSV format and because it’s rather small, we can use the Pandas library to remove the first line and last two lines from the file. These contain comments and will only make working with this file cumbersome in a Hadoop environment. The first line of our code imports the Pandas package, while the second line parses the file into memory and removes the first and last two data lines. The third code line saves the data to the local file system for later use and easy inspection.
At the heart of Hadoop we find

This is an excerpt from Manning's book Mahout in Action.
Sophisticated machine learning techniques, applied at scale, were until recently only something that large, advanced technology companies could consider using. But today computing power is cheaper than ever and more accessible via open source frameworks like Apache’s Hadoop. Mahout attempts to complete the puzzle by providing quality, open source implementations capable of solving problems at this scale with Hadoop, and putting this into the hands of all technology organizations.
Some of Mahout makes use of Hadoop, which includes an open source, Java-based implementation of the MapReduce distributed computing framework popularized and used internally at Google (http://labs.google.com/papers/mapreduce.html). MapReduce is a programming paradigm that at first sounds odd, or too simple to be powerful. The MapReduce paradigm applies to problems where the input is a set of key-value pairs. A map function turns these key-value pairs into other intermediate key-value pairs. A reduce function merges in some way all values for each intermediate key to produce output. Actually, many problems can be framed as MapReduce problems, or as a series of them. The paradigm also lends itself quite well to parallelization: all of the processing is independent and so can be split across many machines. Rather than reproduce a full explanation of MapReduce here, we refer you to tutorials such as the one provided by Hadoop (http://hadoop.apache.org/mapreduce/docs/current/mapred_tutorial.html).
Hadoop implements the MapReduce paradigm, which is no small feat, even given how simple MapReduce sounds. It manages storage of the input, intermediate key-value pairs, and output; this data could potentially be massive and must be available to many worker machines, not just stored locally on one. It also manages partitioning and data transfer between worker machines, as well as detection of and recovery from individual machine failures. Understanding how much work goes on behind the scenes will help prepare you for how relatively complex using Hadoop can seem. It’s not just a library you add to your project. It’s several components, each with libraries and (several) standalone server processes, which might be run on several machines. Operating processes based on Hadoop isn’t simple, but investing in a scalable, distributed implementation can pay dividends later: your data may quickly grow to great size, and this sort of scalable implementation is a way to future-proof your application.
In chapter 6, this book will try to cut through some of that complexity to get you running on Hadoop quickly, after which you can explore the finer points and details of operating full clusters and tuning the framework. Because this complex framework that needs a great deal of computing power is becoming so popular, it’s not surprising that cloud computing providers are beginning to offer Hadoop-related services. For example, Amazon offers Elastic MapReduce (http://aws.amazon.com/elasticmapreduce/), a service that manages a Hadoop cluster, provides the computing power, and puts a friendlier interface on the otherwise complex task of operating and monitoring a large-scale job with Hadoop.
Now it’s possible to translate the algorithm into a form that can be implemented with MapReduce and Apache Hadoop. Hadoop, as noted before, is a popular distributed computing framework that includes two components of interest: the Hadoop Distributed Filesystem (HDFS), and an implementation of the MapReduce paradigm.
The subsections that follow will introduce, one by one, the several MapReduce stages that come together into a pipeline that makes recommendations. Each individually does a little bit of the work. We’ll look at the inputs, outputs, and purpose of each stage. Be prepared for plenty of reading; even this simple recommender algorithm will take five MapReduce stages, and this is a simplified form of how it exists in Mahout! By the end of this section you’ll have seen a complete end-to-end recommender system based on Hadoop.
Note that this chapter will use the Hadoop APIs found in version 0.20.2 of the framework. The code that follows can be found in its complete form within Mahout, and it should be runnable with Hadoop 0.20.2 or later versions of the 0.20.x branch. In particular, refer to org.apache.mahout.cf.taste.hadoop.item.RecommenderJob, which invokes the actual implementation of all of the following processes.
In order to run RecommenderJob, and allow Hadoop to run these jobs, you need to combine all of this code into one JAR file, along with all of the code it depends upon. This can be accomplished easily by running mvn clean package from the core/directory in the Mahout distribution—this will produce a file like mahout-core-0.5-job.jar. Or you can use a precompiled job JAR from Mahout’s distribution.