This chapter covers
- How images and other perceptual data, such as audio, are represented as multidimensional tensors
- What convnets are, how they work, and why they are especially suitable for machine-learning tasks involving images
- How to write and train a convnet in TensorFlow.js to solve the task of classifying hand-written digits
- How to train models in Node.js to achieve faster training speeds
- How to use convnets on audio data for spoken-word recognition
The ongoing deep-learning revolution started with breakthroughs in image-recognition tasks such as the ImageNet competition. There is a wide range of useful and technically interesting problems that involve images, from recognizing the contents of images to segmenting images into meaningful parts, and from localizing objects in images to synthesizing images. This subarea of machine learning is sometimes referred to as computer vision.[1] Computer-vision techniques are often transplanted to areas that have nothing to do with vision or images (such as natural language processing), which is one more reason why it is important to study deep learning for computer vision.[2] But before delving into computer-vision problems, we need to discuss the ways in which images are represented in deep learning.
1Note that computer vision is itself a broad field, some parts of which use non-machine-learning techniques beyond the scope of this book.