4 Using Hugging Face for Computer Vision Tasks
This chapter covers
- The different types of computer vision models on Hugging Face
- Various ways to use the models for object detection
- How to perform image classification
- How to perform image segmentation
- How to perform video content classification
In the previous chapter, you learned about Hugging Face transformers and pipelines. You also learned how to make use some of the pre-trained models for NLP tasks, such as sentiment analysis and text translation, to name a few. Besides NLP tasks, Hugging Face also provides a vast collection of pre-trained models for computer vision tasks (see Figure 4.1; https://huggingface.co/models).
Figure 4.1 Computer Vision-related models on Hugging Face

Using all these hosted pre-trained models, you can create interesting applications such as detecting objects in images, detecting the age of a person, and more. In this chapter, you will make use a number of all these models for computer vision tasks.
4.1 Types of Computer Vision Models on Hugging Face
The computer vision models hosted on Hugging Face are grouped into the following tasks:
- Object Detection
- Image Classification
- Image Segmentation
- Video Classification
- Depth Estimation
- Image-to-Image
- Unconditional Image Generation
- Zero-Shot Image Classification
In this chapter, you will learn how to perform some of these tasks using the models hosted by Hugging Face. Specifically, you will learn the first four tasks listed above.