Fig 1.
(a) Causal intervention in the sample caused by long-range dependencies. The same class is represented in the same color, where A ↔ D and A1 ↔ D1 represent false causal between different classes of regions. (b) Modeling of long-range dependencies between regional samples. The long-range dependences of the same classes in different samples(A ↔ A1,D ↔ D1) is established and the false causal between different classes is cut off. Each class captures long-range dependencies through pixel dependencies in a smaller area.
Fig 2.
We propose a two-step network structure.
CAMs in the figure result from the visualization of all the classes. The generated by RSA was restored to the size of C by the Merge module after being corrected by the cross-image comparison module, C generated by two-step provides equivariant constraint supervision for the merged version of
.
Fig 3.
The two modules applied in step 1 of Fig 2 is described in detail.
The reshape represents tensor size transformation, ⊗ Stands for matrix dot product operation. RSA module: The similarity matrix with the calculated size of MN × MN is normalized,The refined CAM can be obtained by weighting the original CAM
. CFC module:The foreground vector is formed by matrix multiplication of
and
.
Table 1.
Comparative experiments based on different values of λ and β, the quality of CAMs assessed by PASCAL VOC2012 training dataset in mIoU%.
Table 2.
Comparison experiment with PuzzleCam: Two different backbones are used to experiment on PASCAL VOC2012 training dataset in mIoU%.
Fig 4.
Pseudo masks on PASCAL VOC 2012 train dataset.
From top to bottom are original images; ground truth; The prediction results of PuzzleCam; The prediction results of our method.
Fig 5.
Pseudo masks on PASCAL VOC 2012 val dataset.
From top to bottom are original images; ground truth; The prediction results of PuzzleCam; The prediction results of our method.
Table 3.
Performance on the PASCAL VOC 2012 validation set, compared to weakly supervised approaches based only on image-level labels.
Table 4.
Performance on the PASCAL VOC 2012 test set, compared to weakly supervised approaches based only on image-level labels.
Table 5.
Comparison of our proposed method and existing state-of-the-art methods on the PASCALVOC2012 val and test.
I, image-level labels; S, saliency label.
Table 6.
Comparison of our proposed method and existing methods on the MS COCO 2014 val.
Table 7.
The quality of the pseudo mask is evaluated on PASCAL VOC2012 Training Dataset. ResNet50 was used as the backbone.
Fig 6.
ResNet50 was used as the backbone network for analysis. CAM in the figure is generated by PASCAL VOC2012 Training Dataset.
Fig 7.
Ablation on the iterations number of PASCAL VOC2012 Training Dataset, and shows the mIoU, mFDR, and mFNR for different iterations, the backbone network is ResNest-101: (a) baseline + CFC; (b) baseline + RSA; (c) baseline + CFC + RSA; (d) shows the loss for the different number of iterations of the final version (c).
Fig 8.
Column 1 is the original figure, and columns 1, 3, 4, and 5 represent the cam generated with different training epoch.