10 Visual embeddings

 

This chapter covers

  • Expressing similarity between images via loss functions
  • Training CNNs to achieve a desired embedding function with high accuracy
  • Using visual embeddings in real-world applications

by Ratnesh Kumar

Obtaining meaningful relationships between images is a vital building block for many applications that touch our lives every day, such as face recognition and image search algorithms. To tackle such problems, we need to build an algorithm that can extract relevant features from images and subsequently compare them using their corresponding features.

Ratnesh Kumar obtained his PhD from the STARS team at Inria, France, in 2014. While working on his PhD, he focused on problems in video understanding: video segmentation and multiple object tracking. He also has a Bachelor of Engineering from Manipal University, India, and a Master of Science from the University of Florida at Gainesville. He has co-authored several scientific publications on learning visual embedding for re-identifying objects in camera networks.

In the previous chapters, we learned that we can use convolutional neural networks (CNNs) to extract meaningful features for an image. This chapter will use our understanding of CNNs to train (jointly) a visual embedding layer. In this chapter’s context, visual embedding refers to the last fully connected layer (prior to a loss layer) appended to a CNN. Joint training refers to training both the embedding layer and the CNN parameters jointly.

10.1 Applications of visual embeddings

10.1.1 Face recognition

10.1.2 Image recommendation systems

10.1.3 Object re-identification

10.2 Learning embedding

10.3 Loss functions

10.3.1 Problem setup and formalization

10.3.2 Cross-entropy loss

10.3.3 Contrastive loss

10.3.4 Triplet loss

10.3.5 Naive implementation and runtime analysis of losses

10.4 Mining informative data