Chapter 4. Recognizing images and sounds using convnets

This chapter covers

How images and other perceptual data, such as audio, are represented as multidimensional tensors
What convnets are, how they work, and why they are especially suitable for machine-learning tasks involving images
How to write and train a convnet in TensorFlow.js to solve the task of classifying hand-written digits
How to train models in Node.js to achieve faster training speeds
How to use convnets on audio data for spoken-word recognition

The ongoing deep-learning revolution started with breakthroughs in image-recognition tasks such as the ImageNet competition. There is a wide range of useful and technically interesting problems that involve images, from recognizing the contents of images to segmenting images into meaningful parts, and from localizing objects in images to synthesizing images. This subarea of machine learning is sometimes referred to as computer vision.^[1] Computer-vision techniques are often transplanted to areas that have nothing to do with vision or images (such as natural language processing), which is one more reason why it is important to study deep learning for computer vision.^[2] But before delving into computer-vision problems, we need to discuss the ways in which images are represented in deep learning.

¹Note that computer vision is itself a broad field, some parts of which use non-machine-learning techniques beyond the scope of this book.

Chapter 4. Recognizing images and sounds using convnets

This chapter covers

4.1. From vectors to tensors: Representing images

4.2. Your first convnet

4.3. Beyond browsers: Training models faster using Node.js

4.4. Spoken-word recognition: Applying convnets on audio data

Exercises

Summary