chapter nine

9 Extracting value from large data sets with AI

This chapter covers

Using Amazon Comprehend for named entity recognition (NER)
Understanding Comprehend’s modes of operations (asynchronous, batch, and synchronous)
Using asynchronous Comprehend services
Triggering Lambda functions using S3 notifications
Handling errors in Lambdas using a dead-letter queue
Processing results from Comprehend

Chapter 8 dealt with the challenge of gathering unstructured data from websites for use in machine learning analysis. This chapter builds on the serverless web crawler from chapter 8. This time, we are concerned with using machine learning to extract meaningful insights from the data we gathered. If you didn’t work through chapter 8, you should go back and do so now before proceeding with this chapter, as we will be building directly on top of the web crawler. If you are already comfortable with that content, we can dive right in and add the information extraction parts.

9.1 Using AI to extract significant information from web pages

Let’s remind ourselves of the grand vision for our chapter 8 scenario--finding relevant developer conferences to attend. We want to facilitate a system that allows people to search for conferences and speakers of interest to them. In chapter 8’s web crawler, we built a system that solved the first part of this scenario--gathering data on conferences.

9.1.1 Understanding the problem

9.1.2 Extending the architecture

9.2 Understanding Comprehend’s entity recognition APIs

9.3 Preparing data for information extraction

9.3.1 Getting the code

9.3.2 Creating an S3 event notification

9.3.3 Implementing the preparation handler

9.3.4 Adding resilience with a dead letter queue (DLQ)

9.3.5 Creating the DLQ and retry handler

9.3.6 Deploying and testing the preparation service

9.4 Managing throughput with text batches

9.4.1 Getting the code

9.4.2 Retrieving batches of text for extraction