Chapter 5. First steps in big data
This chapter covers
- Taking your first steps with two big data applications: Hadoop and Spark
- Using Python to write big data jobs
- Building an interactive dashboard that connects to data stored in a big data database
Over the last two chapters, we’ve steadily increased the size of the data. In chapter 3 we worked with data sets that could fit into the main memory of a computer. Chapter 4 introduced techniques to deal with data sets that were too large to fit in memory but could still be processed on a single computer. In this chapter you’ll learn to work with technologies that can handle data that’s so large a single node (computer) no longer suffices. In fact it may not even fit on a hundred computers. Now that’s a challenge, isn’t it?
We’ll stay as close as possible to the way of working from the previous chapters; the focus is on giving you the confidence to work on a big data platform. To do this, the main part of this chapter is a case study. You’ll create a dashboard that allows you to explore data from lenders of a bank. By the end of this chapter you’ll have gone through the following steps:
- Load data into Hadoop, the most common big data platform.
- Transform and clean data with Spark.
- Store it into a big data database called Hive.
- Interactively visualize this data with Qlik Sense, a visualization tool.