Chapter 5. Moving data into and out of Hadoop

 

This chapter covers

  • Understanding key design considerations for data ingress and egress tools
  • Low-level methods for moving data into and out of Hadoop
  • Techniques for moving log files and relational and NoSQL data, as well as data in Kafka, in and out of HDFS

Data movement is one of those things that you aren’t likely to think too much about until you’re fully committed to using Hadoop on a project, at which point it becomes this big scary unknown that has to be tackled. How do you get your log data sitting across thousands of hosts into Hadoop? What’s the most efficient way to get your data out of your relational and No/NewSQL systems and into Hadoop? How do you get Lucene indexes generated in Hadoop out to your servers? And how can these processes be automated?

Welcome to chapter 5, where the goal is to answer these questions and set you on your path to worry-free data movement. In this chapter you’ll first see how data across a broad spectrum of locations and formats can be moved into Hadoop, and then you’ll see how data can be moved out of Hadoop.

5.1. Key elements of data movement

5.2. Moving data into Hadoop

5.3. Moving data out of Hadoop

5.4. Chapter summary