chapter two

2 Pretrained networks

This chapter covers

Running pretrained image-recognition models
Working with pretrained transformers and diffusion models
Accessing models through Hugging Face
Captioning images with a pretrained model

In our first chapter, we hinted at the transformative potential of deep learning, and now it’s time to deliver. Computer vision is certainly one of the fields that has been most affected by the advent of deep learning, for a variety of reasons. As the need to classify or interpret the content of natural images grew, very large datasets became available, and new constructs such as convolutional layers were invented and could be run quickly on GPUs with unprecedented accuracy. All of these factors are combined with the internet giants’ desire to understand pictures taken by millions of users with their mobile devices and managed on their platforms. Quite the perfect storm.

2.1 A pretrained network that recognizes the subject of an image

2.1.1 Obtaining a pretrained network for image recognition

2.1.2 AlexNet

2.1.3 The Vision Transformer

2.1.4 Ready, set, almost run

2.1.5 Run!

2.2 Generating and editing images

2.2.1 The inpainting process

2.2.2 A network that turns horses into zebras

2.3 Model Zoo: Hugging Face

2.4 A pretrained network that describes scenes

2.4.1 BLIP in action

2.5 Conclusion

2.6 Exercises

Summary