Improving object detection quality with structural constraints

Recent researches revealed object detection networks using the simple “classification loss + localization loss” training objective are not effectively optimized in many cases, while providing additional constraints on network features could effectively improve object detection quality. Specifically, some works used constraints on training sample relations to successfully learn discriminative network features. Based on these observations, we propose Structural Constraint for improving object detection quality. Structural constraint supervises feature learning in classification and localization network branches with Fisher Loss and Equi-proportion Loss respectively, by requiring feature similarities of training sample pairs to be consistent with corresponding ground truth label similarities. Structural constraint could be applied to all object detection network architectures with the assist of our Proxy Feature design. Our experiment results showed that structural constraint mechanism is able to optimize object class instances’ distribution in network feature space, and consequently detection results. Evaluations on MSCOCO2017 and KITTI datasets showed that our structural constraint mechanism is able to assist baseline networks to outperform modern counterpart detectors in terms of object detection quality.


Introduction
Object detection is a fundamental computer vision technology with a broad range of application scenarios, such as autonomous driving. It's a compound task of object classification and localization. Modern object detectors are trained by matching their detection results with ground truth labels, and then minimizing the loss measuring the differences of these label-prediction matches. Each match's loss is constituted with two terms, measuring classification error and localization error respectively. The complete loss is the sum of the two terms of all matches. In such a loss, each detection result is evaluated independently and only required to fit to the matched ground truth label. Though this loss form is simple, recent researches revealed that object detection networks could not be effectively trained by directly minimizing such a loss in many cases [1], while some researches showed that object detection quality could be effectively improved with additional constraints on intermediate network features [2]. Specifically, recent researches on network-based clustering [3] showed that feature learning could be effectively guided for the benefit of the main task, under constraints on mutual relations between training samples. This indicates it's possible to optimize object class distributions in network feature space for the benefit of object recognition. Thus, it's reasonable to expect object detection quality improvement by complementing the basic loss form of modern detectors with additional constraints on training sample relations in intermediate feature space. This work presents a training-sample-relation-based constraint on object detection network training for improving detection quality. We name these Structural Constraints, because these constraints exert influence on the structure of training sample distribution in object detection network feature space (as is shown later in Fig 3). Structural constraints append two terms to the basic loss, Fisher Loss and Equi-proportion Loss, for constraining the relations of training samples in classification branch space and localization branch space respectively. For an arbitrary pair of training samples, Fisher loss measures the difference between pairwise sample feature similarity and pairwise classification target similarity, while equi-proportion loss measures the difference between pairwise sample feature similarity and pairwise localization target similarity. Thus, under the constraint of these two terms, training sample feature distributions in classification branch and localization branch more resemble ground truth label distributions. As a result, features of these network branches could be more easily transformed to accurate detection results. Structural constraints could be applied to object detection networks of various architectures, like single-stage, two-stage and multi-stage networks, without changing the original network structures or influencing detection rates. In our experiments, we evaluated structural constraints' effect on representative object detection networks of various architectures on different image datasets. These experiment results demonstrated that structural constraints could improve object detection quality noticeably on a broad range of detectors.
To summarize, novel contributions of this work are: • proposing Fisher loss function as part of structural constraints to constrain training sample feature relations for improving classification performance of object detection networks; • proposing equi-proportion loss function as part of structural constraints to constrain training sample feature relations for improving localization performance; • a mechanism for applying structural constraints to various object detection network architectures.
The rest of this paper is organized as follows: Section 2 reviews related works, Section 3 describes in detail structural constraint and the mechanism of applying it to networks, Section 4 presents our experiment results and analysis, and finally Section 5 concludes this work.

Related works
In this section, we review some previous works closely related to structural constraints proposed in this work, and we confine the scope to works based on neural networks. At first, we review some deep learning models for image recognition with feature learning constraints; then, we review representative object detection networks of various architectures.

Feature learning constraints
Feature learning constraints are widely adopted in deep-learning-based image recognition domain. Some works on object detection use feature learning constraints to improve object detection quality. RIFD-CNN [2] used two types of constraints on its network's intermediate layer features, one for rotation invariance and one for Fisher discrimination. Its rotation invariance constraint requires the intermediate feature representation of each training sample image to be similar to the average intermediate feature representation of the rotated versions of the image, so the subsequent classification based on this type of features will be robust against influence of object rotation. Its Fisher discrimination constraint requires each class's training sample intermediate features to lie close to the mean of the class, and each class's mean feature to lie distant from the global mean of all classes, so the subsequent classification layer could easily and accurately separate the classes from each other. Using these two constraints, RIFD-CNN achieved significant object detection accuracy improvement.
DETR [4] is another object detection network using feature learning constraints. DETR uses transformer to process feature maps from its backbone into detection results. Its transformer's encoder consists of multiple layers of attention mechanism, and the detection results are produced by the last attention mechanism layer. However, other attention mechanism layers' intermediate features are also required to be transformed into accurate detection results through the same detection head shared with the last layer. This deep supervision is in essence a type of feature learning constraint: the supervision on the intermediate attention mechanism layers constrains their output features to facilitate the subsequent inference for better detection accuracy.
Feature learning constraints have also been used to solve image clustering problems. Deep Self-evolution Clustering (DSEC) [3] network and Deep Adaptive Clustering (DAC) [5] network constrain their output features' pairwise relationships to make these features directly express cluster identities. These clustering networks' constraints require the dot products of aribitrary pairs of output features to be close to corresponding pseudo labels. These pseudo labels reflect cosine similarities of the feature pairs. As a result, the training of these networks under this type of constraints gradually makes the output features to be one-hot vectors which express cluster identities directly. Factually, this type of constraints on pairwise feature relationships are the only content in these two clustering networks' training objective functions.
Compared with the feature learning constraints in the works described above, structural constraints in this work exhibit both similarity and difference. Like RIFD-CNN, structural constraints are applied over intermediate layer features of object detection networks; like DSEC and DAC, structural constraints are based on pairwise training sample feature relations. However, the combination of these two characteristics is absent in all these works. Besides, as constraints for object detection networks, RIFD-CNN's constraints are applied for classification merely, while structural constraints are applied for both classification and localization. Furthermore, RIFD-CNN did not constrain inter-class training sample relations, while our structural constraints' Fisher loss constrains both inter-and intra-class relations over all training sample pairs.

Object detection network architectures
Until now, object detection networks exhibited two types of architectures: networks generating detections in a single stage, and networks generating detections through several stages of refinements. We review these architectures below.
2.2.1 Single-stage object detection networks. Single-stage object detection networks transform input images' backbone feature maps into detection results directly, through a single detection head. SSD [6] is the forerunner of this architecture. It scatters boxes of various sizes and aspect ratios over input images' feature maps and infers classes and adjustments for these boxes to form detection results. Detection results across feature map pyramid levels are synthesized to infer final detection results. The boxes initially scattered are then known as anchors.
YOLO [7] is another single-stage detection network that is fast at inference. It additionally infers a confidence value for each bounding box, which represents probability of existence of objects within the bounding box, and these confidence values participate in the decision of final detection results. However, YOLO's detection quality is not satisfying.
RetinaNet [1] is a high-detection-quality single-stage object detection network. It focuses on dealing with imbalance between foreground and background training samples, which is a crucial cause of poor detection quality of many other single-stage networks. It proposes Focal Loss to replace the widely adopted cross entropy loss for classification task. By using focal loss, RetinaNet is able to allocate more weights on poorly classified hard samples during training, and make the trained network better generalize to test data.
2.2.2 Object detection networks with several stages. Another kind of object detection networks are constituted with more than one stage. These networks could be further divided into two groups according to number of network stages: two-stage networks and multi-stage (more than two) networks. The first network stage of all these object detection networks are responsible for generating region proposals, also known as RoIs (regions of interest). Twostage networks then refine the region proposals with a detection head to produce final detection results, while multi-stage networks refine the region proposals with several detection heads in sequence. We review representatives of these architectures below.
Two-stage object detection networks. Two-stage object detection networks appeared early among all architectures, and usually produce better detection quality than single-stage networks. Faster RCNN [8] is the forerunner of this architecture. Faster RCNN introduced RPN (Region Proposal Network) upon the basis of Fast RCNN [9]. RPN takes backbone feature maps as input and inferences RoIs and corresponding confidence values. These RoIs are then used to extract features from backbone feature maps through RoI pooling operation, and these features are passed into a fully connected detection head to inference detection results.
R-FCN [10] focuses on accelerating inference rate of Faster RCNN by reducing redundant computation of detection head. R-FCN's detection head is constituted with convolutional layers, and is able to generate a special feature map of which different channels are sensitive to different parts of target objects. Then, RoI pooling over this feature map could easily decide whether an RoI accurately localizes an object and corresponding class, by filling RoI parts with features from corresponding channels. Since most necessary computation is done by the convolutional detection head and the remaining RoI pooling operations cost only subtle computation, R-FCN's inference is time-efficient.
Double-head RCNN [11] is another two-stage network whose second stage is composed of two detection heads in parallel, one fully connected head and one convolutional head. This design is based on the observation that fully connected layers are sensitive to spatial completeness of objects, while convolutional layers are robust against occlusion and deformation. Thus Double-head RCNN uses its fully connected head to infer classification scores which should reflect localization quality, and uses its convolutional head to infer bounding boxes to better generalize to various object appearances and influencing contents.
Multi-stage object detection networks. Multi-stage object detection networks extend twostage architecture by appending additional detection heads, refining RoIs with more stages of inferences. Cascade RCNN [12] is a typical multi-stage object detection network. During Cascade RCNN training, each stage's detection head is trained from detection results of its previous stage. At inference, each stage's detection head takes features from RoI pooling based on its previous stage's detection boxes, and generates new detection results. The final detection results take the last stage's detection head's output boxes as localization results, and take the averages of all detection heads' class scores as classification results. The increased network stages improved detection quality noticeably, making Cascade RCNN one of the most accurate object detectors by then.
Hybrid Task Cascade [13] is a multi-stage network capable of both object detection and instance segmentation. Hybrid Task Cascade inherited the network structure of Cascade RCNN, and introduced additional components and links. It introduced a semantic segmentation convolutional branch to provide helpful inputs to its detection heads and mask heads. The detection quality of Hybrid Task Cascade is outstanding in multi-stage group, but the whole of its network is cumbersome.
All representative object detection networks mentioned above and many others lack constraints on relationships between training samples in feature spaces, so structural constraints proposed in this work are able to complement them in this respect. We will show that structural constraints are applicable to all these architectures through a unified mechanism in next section.

Structural constraint mechanism
In this section, we describe sturctural constraint mechanism for object detection in detail. Firstly, we explain the motivation of structural constraints. Then, we present the definition of structural constraints. After these, we describe the mechanism of combining structural constraints with object detection networks.

Motivation
The reason of we proposing structural constraints is based on two observations: first, the lack of constraints on training sample relationships in modern object detection networks; second, the importance of feature learning exhibited in many other image recognition tasks. As described in Section 1, it could be observed that most modern object detection networks' loss functions usually have a form like this: where L cls and L loc are two loss terms for measuring classification error and localization error respectively. For each match, the difference between the estimated class probability vector p i and the corresponding ground truth vector p gt i is calculated by L cls , and the difference between the estimated bounding box b i and the corresponding ground truth box b gt i is measured by L loc . Loss functions like this only force each detection result to fit to its matched ground truth. They are simple in form, but could not be effectively minimized in many cases, since the supervision on object classification could be severely influenced by large amount of background training samples [1]. We observed that additional supervision on one training sample could come from the other training samples, since one training sample could be represented by its relative differences from the others. This could be understood by looking at some works on image clustering, such as DSEC [3], where the clustering network was effectively trained under the supervision on similarity of each pair of training samples. Thus, structural constraints are designed to supervise the differences between each pair of sample detections. Because of that object detection consists of classification and localization, structural constraints use two types of loss functions to measure sample pairs' classification differences and localization differences, namely Fisher loss and equi-proportion loss.
We also observed that proper supervision on object detection networks' intermediate features could effectively improve detection quality. Examples are RIFD-CNN [2]'s rotation invariance constraint and Fisher discrimination constraint on its backbone's intermediate layers, and DETR [4]'s auxiliary supervisions on multiple levels of transformer decoders. Apart from this, we try to avoid disrupting optimization of the main objective in Eq (1). Thus, instead of being applied over object detection networks' final outputs, structural constraints are applied over the networks' intermediate features to guide the feature learning.

Definition
Structural constraints take training samples' intermediate features as input. To supervise training samples' relations during classification and localization, structural constraints use Fisher loss and equi-proportion loss to constrain pairwise feature differences respectively. These losses in structural constraints and the basic object detection objective in Eq (1) altogether form the new training objective.
Fisher loss in structural constraints calculates the similarity between an arbitrary pair of intermediate features of training samples, and supervises this with the corresponding pair of class labels' similarity. It's expressed as: where σ(�) is sigmoid function, f i is a transformed intermediate feature vector of training sample i, and p gt i 2 ½0; 1� C is the corresponding one-hot class label, with C being the number of object classes. Fisher loss L Fisher calculates the similarity between f i and f j , and the similarity between p gt i and p gt j , both in terms of dot production. The squared difference between these two similarities is used as the loss value. To make the comparison between these similarities fair, f i is obtained by linearly transforming the intermediate feature into the same dimensionality as p gt i . Since f i acts as a proxy of the intermediate feature, we name it Proxy Feature. Before calculating the similarity, the proxy feature vectors' elements are transformed by σ(�) into the same range [0, 1] as p gt i . By supervising the similarity between proxy feature vectors, Fisher loss drives the similarity between the underlying intermediate features to be consistent with the similarity of the corresponding class labels. As a result, Fisher loss produces the effect of reducing intra-class variance and increasing inter-class separation of training sample distribution, which benefits object classification.
Equi-proportion loss is another loss term in structural constraints. It also measures the similarity between an arbitrary pair of intermediate features, but supervises this with the corresponding pair of localization labels. It's expressed as: where f 0 i is proxy feature of training sample i, and b gt i 2 R 4 is the corresponding localization label. f 0 i is linearly transformed from the intermediate feature into same dimensionality as b gt i , to facilitate the comparison between training sample difference and localization label difference. Since b gt i ; b gt j are not bounded, we measure their relative difference in terms of elementwise ratios, and so is the difference between f 0 i and f 0 j measured. The squared magnitude of the difference between these two sets of ratios is used as the value of L equip . Under the guidance of equi-proportion loss, the intermediate features of training samples tend to be sensitive enough to reflect the differences between their localization labels, and benefit bounding box regression.
After applying structural constraint constituted with Fisher loss and equi-proportion loss, the object detection network training objective is rewritten as: where Fisher loss and equi-proportion loss are evaluated for all pairs of training samples. This sum of original loss and structural constraint terms is used to calculate back-propagations during end-to-end object detection network training processes. Thus, training with this new objective not only optimizes the main objective of object detection, but also optimizes the structure of training sample distribution in intermediate feature space which benefits the main objective in return.

Combination with various object detection architectures
Structural constraints supervise intermediate features of object detection networks, that is, applied over intermediate network layers, so how they are combined with networks depends on the forms of these layers, which differ among object detection architectures. We describe how structural constraints are combined with single-stage, two-stage and multi-stage object detection networks respectively below. Single-stage case. Single-stage object detection networks' detection heads transform backbone feature maps with two-dimensional convolution (Conv2D) to generate classification outputs and localization outputs. Because that the dimensionality of proxy features used in Fisher loss and equi-proportion loss calculation must be unified with the dimensionality of classification outputs and localization outputs respectively, structural constraints in singlestage networks use Conv2D layers to transform intermediate features of training samples into the needed proxy features. This could be expressed as: where Conv2D Fisher and Conv2D equip are convolution layers generating proxy features for Fisher loss and equi-proportion loss respectively, and F is intermediate feature collection. Con-v2D Fisher and Conv2D equip take F as input and generate proxy feature collections ff i g i ; ff 0 i g i . It should be noticed that F, {f i } i and ff 0 i g i take the form of feature tensors in this case. With proxy features obtained, the rest of structural constraint evaluation is exactly same as the description in Section 3.2. The complete mechanism in single-stage case is illustrated in Fig 1a. Two-stage case. Two-stage object detection networks firstly generate RoIs with their RPNs, and then their detection heads infer detection results from these RoIs. Their detection heads usually consist of fully-connected (FC) layers. Thus, for the same reason as in singlestage case, we set up special FC layers for transforming intermediate features into proxy features whose dimensionality is unified with detection head outputs. This could be expressed as: where FC Fisher and FC equip are the FC layers that generate proxy features for Fisher loss and equi-proportion loss respectively. In this case, the intermediate feature collection F comes from RoI pooling. The rest of structural constraint evaluation is still same as the description in Section 3.2. Apart from the detection heads, structural constraints could also be applied to RPNs of two-stage networks, because these RPNs are identical to single-stage networks' detection heads. This means the aforementioned mechanism for single-stage case could be directly applied to these RPNs. The complete structural constraint mechanism for two-stage case is illustrated in Fig 1b. Multi-stage case. Multi-stage object detection networks extend two-stage architecture by using multiple detection heads to refine detection results sequentially. Thus, compared with two-stage networks, the constituting modules of multi-stage networks remain unchanged. This means how structural constraints are applied to detection heads and RPNs in multi-stage networks is exactly same as the two-stage case. For structural constraints on detection heads, the proxy features are generated in the same manner as Eq (6); on RPNs, they are generated in the same manner as Eq (5). All the rest of structural constraint evaluation still obey Section 3.2. The complete mechanism for multi-stage case is illustrated in Fig 1c. In all cases above, structural constraint mechanisms exist during the training period of these object detection networks, and guide the intermediate feature learning by handling proxy features. At inference time, all calculations related to structural constraints are absent, as well as all exclusive network layers (Conv2D Fisher/equip , FC Fisher/equip ), so detection rates and deployment sizes of these networks are not influenced.

Experiments
To verify the effectiveness of structural constraints, we experimented with multiple object detection networks over several image datasets, and examined the training processes and network behaviors. In this section, we present these experiment results.

Experiment settings
We describe settings of the experiments firstly. These include settings of networks, training and testing. All hyper-parameters listed below are set to default values of MMDetection [14] configuration files.
Networks. The default settings of object detection networks used in the experiments are: ResNet-101 [15] as backbone, FPN [16] as neck, and Greedy NMS [17] for post-processing. All multi-stage networks use 3 stages of detection heads. All object detection networks are implemented with MMDetection toolbox [14] and PyTorch deep learning library [18].

Experiment results
We present experiment results on structural constraint mechanism in this subsection. Firstly, we present ablation evaluation results to show influences of different factors in the mechanism. Then, we compare object detection quality of our structural-constraint-applied networks with other modern detectors. Finally, we analyze behaviors of structural constraint mechanism through visualization.

Ablation evaluation.
We performed ablation evaluations on structural constraint mechanism to investigate different factors' influences on object detection quality, including the constituting loss terms L Fisher and L equip as well as different combination manners. We report our evaluation results on two widely used image datasets, MSCOCO2017 [19] and KITTI [20], respectively.
Evaluations on MSCOCO2017. For ablation on MSCOCO2017, all object detection networks are trained over the train2017 subset, and tested over the val2017 subset. We choose RetinaNet as the evaluation subject for single-stage architecture, Faster RCNN for twostage, and Cascade RCNN for multi-stage. The ablation evaluation results are shown in Table 1. The network names containing "+L Fisher/equip " indicate that Fisher loss or equi-proportion loss is applied to the detection heads of those networks, and names with "þL Fisher =equip 2 " indicate Fisher loss or equi-proportion loss is applied to both the detection heads and RPNs of those networks (in two-or multi-stage case). It could be observed that structural constraint mechanism is able to improve object detection qualities of all network subjects on this general object detection task. Specifically, the complete structural constraint mechanism that includes both Fisher loss and equi-proportion loss produced the most obvious improvement in some cases, like Faster RCNN þ L 2 Fisher þ L 2 equip . We also evaluated the influence of batch size, and the networks marked with " � " are trained with smaller batch sizes (half). It could be observed that structural constraint mechanism is robust against batch size changes.
Evaluations on KITTI. We use the 2D object detection subset in KITTI to perform ablation evaluations, which contains 7481 labeled driver-view images. For all evaluated network subjects, the first 6000 images are used for training and the rest 1481 images for testing. We adopted Pascal-VOC-styled metrics which evaluate class-wise average precisions and the global mean average precision (MAP). We choose RetinaNet and SSD as evaluation subjects for single-stage architecture, Faster RCNN for two-stage, and Cascade RCNN for multi-stage. The evaluation results are shown in Table 2. It could be observed that structural constraint mechanism is able to produce object detection quality improvement for all these network architectures. It's also observable that the improvement happened on multiple classes simultaneously, such as the case of Faster RCNN þ L 2 Fisher . Besides, structural constraint mechanism still exhibits robustness against batch size settings, which could be observed from the evaluations on Cascade RCNN.

Comparison with other object detectors.
We present object detection quality comparisons between modern object detectors and our networks with structural constraints in this subsection. These comparisons were carried out over MSCOCO2017 and KITTI. We give descriptions respectively in the following.
Comparison on MSCOCO2017. The training set and testing set for this comparison are same as the settings in last subsection. The evaluation results are presented in Table 3. SCM-Two and SCM-Multi are our two-stage and multi-stage object detection networks with structural constraint mechanisms. SCM-Two is configured as Faster RCNN þ L 2 Fisher þ L 2 equip , and SCM-Multi as Cascade RCNN þ L 2 Fisher . SSD300 and SSD512 are SSD networks with input image sizes as 300 × 300 and 512 × 512 respectively. It could be observed that our SCM-Two  Table 4. Since KITTI's leaderboard publishes detection precisions on car, pedestrian and cyclist, we compare performances on these three classes and the global mean average precisions (MAP). It could be observed that our SCM-Multi network achieved top values on all these metrics. According to these ablation evaluations and comparisons with other modern detectors on different datasets, it's shown that structural constraint mechanism is able to improve object detection quality on various network architectures, and is able to assist some prototype networks to achieve advanced performances.

Visualization analysis
We analyze behaviors of structural constraint mechanism during training and testing in this subsection. For this purpose, we visualized changing of the loss terms in structural constraint, their influences on feature space and some final detection results.
Changing of loss values. We plotted curves of Fisher loss and equi-proportion loss during training of object detection networks of different architectures. The observation subjects include RetinaNet, SSD, Faster RCNN and Cascade RCNN, all with structural constraints applied. These loss curves are shown in Fig 2. Both losses were obviously dropping during all these training processes. This observation indicates that the loss terms in structural constraints are effectively minimized, so they are indeed guiding networks' training.
Influence on network feature space. To observe the influences of structural constraint mechanism on object detection networks' feature spaces, we adopted t-SNE [44] to project high-dimensional backbone features to 2D space for visualization. These backbone features were obtained by feeding the networks with images of object classes. These images are sampled from KITTI according to its bounding box labels and are of class Car or Pedestrian (Ped). The extracted backbone features are then resized to a uniform size for the convenience of t-SNE transform. The visualization results are shown in Fig 3. The network subjects are Faster RCNN and Cascade RCNN. It could be observed that with greater extent of structural constraint application, the distributions of Car and Ped are less mixed and easier to separate. This is a beneficial behavior to object classification, and is consistent with the intention of structural constraints. Detection result visualization. In Fig 4, we visualized some detection results on MSCOCO2017 images (val2017). We compared detection results of Faster RCNNs with  and without structural constraints applied. It could be observed that the application of structural constraints made the detector more accurate at localization and give less false positives.

Conclusion
In this work, we introduced our structural constraint mechanism for improving object detection quality. Structural constraint mechanism supervises object detection networks' intermediate feature spaces, and guides the training processes to optimize object class instances' distributions within the spaces. It constrains feature similarities of training sample pairs to be consistent with corresponding ground truth label similarities. With the aid of proxy feature design, structural constraint could be applied to all types of object detection network architectures. Experiment results indicate our structural constraint mechanism is able to optimize networks' intermediate features and consequently final detection results. It should be pointed out that calculation of structural constraint is done for all possible pairs of training samples, which has high GPU memory demand. We will address this issue in our future work.