Region Proposal Network (RPN)

Overview

The Region Proposal Network (RPN) is a crucial component of the Faster R-CNN architecture, designed to generate candidate regions of interest (RoIs) that are likely to contain objects. It operates directly on the feature map produced by the backbone network and predicts both objectness scores and bounding-box coordinates for a set of predefined anchor boxes. The RPN is responsible for efficiently proposing regions that may contain objects, which are then further processed by the rest of the Faster R-CNN pipeline.

Architecture

The RPN architecture is implemented as a fully convolutional network (FCN), which means it consists solely of convolutional layers and can process inputs of arbitrary size. The network slides a small window over the convolutional feature map output by the backbone network. At each location of this sliding window, a 512-dimensional feature vector is generated using a 3 × 3 convolutional layer. This feature vector is then fed into two sibling 1 × 1 convolutional layers: one for classification (objectness score) and another for regression (bounding-box coordinates).

Figure 11.14 RPN architecture. From each sliding window, a 512-dimensional feature vector is generated using 3 × 3 convs. A 1 × 1 conv layer (classifier) takes the 512-dimensional feature Similarly, another 1 × 1 conv layer (regressor) generates 4k bounding-box coordinates from the 512-dimensional feature vector.

The RPN’s design ensures translation invariance, meaning that objects of similar size and shape are detected consistently regardless of their position in the image. This is achieved by sharing convolutional weights across different positions on the feature map.

Implementation

PyTorch Code for the RPN FCN

The following PyTorch code snippet demonstrates the implementation of the fully convolutional network for the RPN:

class RPN_FCN(nn.Module):

    def __init__(self, k, in_channels=512):  #1
        super(RPN_FCN, self).__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(
                in_channels, 512, kernel_size=3,
                stride=1, padding=1),
            nn.ReLU(True))
        self.cls = nn.Conv2d(512, 2*k, kernel_size=1)
        self.reg = nn.Conv2d(512, 4*k, kernel_size=1)


    def forward(self, x):
        out = self.conv(x)                   #2

        rpn_cls_scores = self.cls(out).view( #3
            x.shape[0], -1, 2)
        rpn_loc = self.reg(out).view(        #4
            x.shape[0], -1, 4)

        #5
       return rpn_cls_scores       ,rpn_loc  #6
#1 Instantiates the small network that is convolved over the output conv feature map. It consists of a 3 × 3 conv layer followed by a 1 × 1 conv layer for classification and another 1 × 1 conv layer for regression.
#2 Output of the backbone: a convolutional feature map of size (batch_size, in_channels, h, w)
#3 Converts (batch_size, h, w, 2k) to batch_size, h*w*k, 2)
#4 Converts (batch_size, h, w, 4k) to batch_size, h*w*k, 4)
#5 (batch_size, num_anchors, 2) tensor representing the classification score for each anchor box
#6 (batch_size, num_anchors, 4) tensor representing the box coordinates relative to the anchor box

Loss Function

The RPN loss function combines classification and regression losses to train the network. The classification loss is computed using cross-entropy, while the regression loss uses Smooth L1 loss. The following code snippet illustrates the RPN loss function:

def rpn_loss(
    rpn_cls_scores, rpn_loc, rpn_labels,
    rpn_loc_targets, lambda_ = 10):                  #1


    classification_criterion = nn.CrossEntropyLoss( 
        ignore_index=-1)                             #2
    reg_criterion = nn.SmoothL1Loss(reduction="sum")

    cls_loss = classification_criterion(rpn_cls_scores, rpn_labels)

    positive_indices = torch.where(rpn_labels==1)[0] #3
    pred_positive_anchor_offsets = rpn_loc[positive_indices]
    gt_positive_loc_targets = rpn_loc_targets[positive_indices]
    reg_loss = reg_criterion(
        pred_positive_anchor_offsets,
        gt_positive_loc_targets) / len(positive_indices)
    return {
        "rpn_cls_loss": cls_loss,
        "rpn_reg_loss": reg_loss,
        "rpn_total_loss": cls_loss + lambda_* reg_loss

    }
#1 rpn_cls_scores: (num_anchors, 2) tensor representing RPN classifier scores for each anchor. rpn_loc: (num_anchors, 4) tensor representing RPN regressor predictions for each anchor. rpn_labels: (num_anchors) representing the class for each anchor (-1, 0, 1). rpn_loc_targets: (num_anchors, 4) tensor representing RPN regressor targets for each anchor.
#2 Ignores -1 as they are not sampled
#3 Finds the positive anchors

Generating Region Proposals

The RPN generates region proposals by transforming anchor boxes based on the predicted offsets and filtering them based on objectness scores. The following code snippet demonstrates how to generate region proposals from the RPN output:

rois = generate_bboxes_from_offset(rpn_loc, anchors)

rois = rois.clamp(min=0, max=width)       #1

roi_heights = rois[:, 3] - rois[:, 1]     #2
roi_widths = rois[:, 2] - rois[:, 0]
min_roi_threshold = 16

valid_idxes = torch.where((roi_heights > min_roi_threshold) &
    (roi_widths > min_roi_threshold))[0]
rois = rois[valid_idxes]
valid_cls_scores = rpn_loc[valid_idxes]


objectness_scores = valid_cls_scores[:, 1]

sorted_idx = torch.argsort(               #3
    objectness_scores, descending=True)
n_train_pre_nms = 12000
n_val_pre_nms = 300

rois = rois[sorted_idx][:n_train_pre_nms] #4

objectness_scores = objectness_scores[ 
    sorted_idx][:n_train_pre_nms]         #5
#1 Clips the ROIs
#2 Threshold based on min_roi_threshold
#3 Sorts based on objectness
#4 Selects the top regions of interest. Shape: (n_train_pre_nms, 4).
#5 Selects the top objectness scores. Shape: (n_train_pre_nms,).

The RPN is a powerful component of the Faster R-CNN architecture, enabling efficient and effective object detection by proposing regions that are likely to contain objects. Its fully convolutional design allows it to process images of varying sizes and maintain translation invariance.