4 Using Hugging Face for Computer Vision Tasks

 

This chapter covers

  • The different types of computer vision models on Hugging Face
  • Various ways to use the models for object detection
  • How to perform image classification
  • How to perform image segmentation
  • How to perform video content classification

In the previous chapter, you learned about Hugging Face transformers and pipelines. You also learned how to make use some of the pre-trained models for NLP tasks, such as sentiment analysis and text translation, to name a few. Besides NLP tasks, Hugging Face also provides a vast collection of pre-trained models for computer vision tasks (see Figure 4.1; https://huggingface.co/models).

Figure 4.1 Computer Vision-related models on Hugging Face

Using all these hosted pre-trained models, you can create interesting applications such as detecting objects in images, detecting the age of a person, and more. In this chapter, you will make use a number of all these models for computer vision tasks.

4.1 Types of Computer Vision Models on Hugging Face

The computer vision models hosted on Hugging Face are grouped into the following tasks:

  • Object Detection
  • Image Classification
  • Image Segmentation
  • Video Classification
  • Depth Estimation
  • Image-to-Image
  • Unconditional Image Generation
  • Zero-Shot Image Classification

In this chapter, you will learn how to perform some of these tasks using the models hosted by Hugging Face. Specifically, you will learn the first four tasks listed above.

4.2 Object Detection

4.2.1 Using the Model Directly

4.2.2 Using Transformers Pipeline

4.2.3 Binding to Webcam

4.3 Image Classification

4.4 Image Segmentation

4.4.1 Using the Model Programmatically

4.4.2 Binding to Gradio

4.5 Video Content Classification

4.5.1 Installing the prerequisites

4.5.2 Downloading the videos for testing

4.5.3 Using the transformers pipeline object

4.6 Summary