Chapter 5. First steps in big data


This chapter covers

  • Taking your first steps with two big data applications: Hadoop and Spark
  • Using Python to write big data jobs
  • Building an interactive dashboard that connects to data stored in a big data database

Over the last two chapters, we’ve steadily increased the size of the data. In chapter 3 we worked with data sets that could fit into the main memory of a computer. Chapter 4 introduced techniques to deal with data sets that were too large to fit in memory but could still be processed on a single computer. In this chapter you’ll learn to work with technologies that can handle data that’s so large a single node (computer) no longer suffices. In fact it may not even fit on a hundred computers. Now that’s a challenge, isn’t it?

We’ll stay as close as possible to the way of working from the previous chapters; the focus is on giving you the confidence to work on a big data platform. To do this, the main part of this chapter is a case study. You’ll create a dashboard that allows you to explore data from lenders of a bank. By the end of this chapter you’ll have gone through the following steps:

  • Load data into Hadoop, the most common big data platform.
  • Transform and clean data with Spark.
  • Store it into a big data database called Hive.
  • Interactively visualize this data with Qlik Sense, a visualization tool.

5.1. Distributing data storage and processing with frameworks

5.2. Case study: Assessing risk when loaning money

5.3. Summary