List of Figures

 

Chapter 1. Data science in a big data world

Figure 1.1. An Excel table is an example of structured data.

Figure 1.2. Email is simultaneously an example of unstructured data and natural language data.

Figure 1.3. Example of machine-generated data

Figure 1.4. Friends in a social network are an example of graph-based data.

Figure 1.5. The data science process

Figure 1.6. Big data technologies can be classified into a few main components.

Figure 1.7. The end result: the average salary by job description

Figure 1.8. Hortonworks Sandbox running within VirtualBox

Figure 1.9. The Hortonworks Sandbox welcome screen available at http://127.0.0.1:8000

Figure 1.10. A list of available tables in HCatalog

Figure 1.11. The contents of the table

Figure 1.12. You can execute a HiveQL command in the Beeswax HiveQL editor. Behind the scenes it’s translated into a MapReduce job.

Figure 1.13. The logging shows that your HiveQL is translated into a MapReduce job. Note: This log was from the February 2015 version of HDP, so the current version might look slightly different.

Figure 1.14. The end result: an overview of the average salary by profession

Chapter 2. The data science process

Figure 2.1. The six steps of the data science process

Figure 2.2. Step 1: Setting the research goal

Figure 2.3. Step 2: Retrieving data

Figure 2.4. Step 3: Data preparation