Chapter 1. Data science in a big data world
Figure 1.1. An Excel table is an example of structured data.
Figure 1.2. Email is simultaneously an example of unstructured data and natural language data.
Figure 1.3. Example of machine-generated data
Figure 1.4. Friends in a social network are an example of graph-based data.
Figure 1.5. The data science process
Figure 1.6. Big data technologies can be classified into a few main components.
Figure 1.7. The end result: the average salary by job description
Figure 1.8. Hortonworks Sandbox running within VirtualBox
Figure 1.9. The Hortonworks Sandbox welcome screen available at http://127.0.0.1:8000
Figure 1.10. A list of available tables in HCatalog
Figure 1.11. The contents of the table
Figure 1.12. You can execute a HiveQL command in the Beeswax HiveQL editor. Behind the scenes it’s translated into a MapReduce job.
Figure 1.13. The logging shows that your HiveQL is translated into a MapReduce job. Note: This log was from the February 2015 version of HDP, so the current version might look slightly different.
Figure 1.14. The end result: an overview of the average salary by profession
Chapter 2. The data science process
Figure 2.1. The six steps of the data science process
Figure 2.2. Step 1: Setting the research goal
Figure 2.3. Step 2: Retrieving data
Figure 2.4. Step 3: Data preparation