chapter eight

8 Gathering data at scale for real-world AI

This chapter covers

Selecting sources of data for AI applications
Building a serverless web crawler to find sources for large-scale data
Extracting data from websites using AWS Lambda
Understanding compliance, legal aspects, and politeness considerations for large-scale data gathering
Using CloudWatch Events as a bus for event-driven serverless systems
Performing service orchestration using AWS Step Functions

In chapter 7, we dealt with the application of natural language processing (NLP) techniques to product reviews. We showed how sentiment analysis and classification of text can be achieved with AWS Comprehend using streaming data in a serverless architecture. In this chapter, we are concerned with data gathering.

According to some estimates, data scientists spend 50-80% of their time collecting and preparing data.1 2 Many data scientists and machine learning practitioners will say that finding good quality data and preparing it correctly are the biggest challenges faced when performing analytics and machine learning tasks. It is clear that the value of applying machine learning is only as good as the quality of the data that is fed into the algorithm. Before we jump straight into developing any AI solution, there are a few key questions to be answered concerning the data that will be used:

What data is required and in what format?
What sources of data are available?
How will the data be cleansed?

8.1 Scenario: Finding events and speakers

8.1.1 Identifying data required

8 Gathering data at scale for real-world AI

This chapter covers

8.1 Scenario: Finding events and speakers

8.1.1 Identifying data required

8.1.2 Sources of data

8.1.3 Preparing data for training

8.2 Gathering data from the web

8.3 Introduction to web crawling

8.3.1 Typical web crawler process

8.3.2 Web crawler architecture

8.3.3 Serverless web crawler architecture

8.4 Implementing an item store