View Author Feedback For Paper

Paper ID 2709
Title Soft-NMS -- Improving Object Detection With One Line of Code

 QuestionResponse
1Rebuttal We thank the reviewers for their constructive comments. We are glad the reviewers agree that the paper provides a “simple method to improve non-maximum suppression”, is “well written”, “widely applicable” and “improvements are fairly consistent”. We address the concerns of the reviewers in this rebuttal.

[R3] Small increase in mAP, equivalent to doing some parameter tuning:
Soft-nms obtains 1.7% and 1.3% on PASCAL and MS-COCO for state-of-the-art detection pipelines. We do not consider this to be insignificant. To address the concern on tuning, we re-trained Deformable R-FCN (larger image size, more anchors, ResNet-101-v1) and added multi-scale testing (MST). With Soft-NMS, we improve state-of-the-art in object detection from 39.8 to 40.9% mAP with a single model. This shows that Soft-NMS improves results even after significant tuning. Please note that Soft-NMS is applied after training, so any improvement with soft-nms will be on top of tuning.

AP@[0.5:0.95]
D-RFCN - 37.4
D-RFCN+SNMS - 38.4
D-RFCN+MST - 39.8
D-RFCN+MST+SNMS - 40.9

Recall@[0.5:0.95]
D-RFCN+MST - 52.9@100, 50.5@10
D-RFCN+MST+SNMS - 60.4@100, 54.7@10

[R3] Regarding Yolo-v2:
We appreciate R3 for taking the effort and implementing Soft-NMS for yolo. We are also happy to see that it gave 0.5% improvement. However, please note that our paper only claims improvements for “state-of-the-art” object detectors, which are proposal based, as mentioned in intro/background sections. Proposal based detectors generate cluttered bounding boxes, which improve recall whereas SSD/Yolo generate fixed set of sparse and less localized detections (relatively), and hence have less potential to improve precision during NMS. SSD with ResNet-101 obtains 31.2%, Yolo-v2 (as mentioned in review) obtains 25.4%, while our baseline and mask-rcnn, which are proposal based obtain 37.4% (39.8 with MST) and 39.8% respectively

[R3] Other papers have addressed adjustments to NMS in the past:
Please cite theoretical or data-driven approaches which lead to improvements like Soft-NMS on standard datasets. A month after this submission, we saw a learning based solution for NMS “Learning Detection with Diverse Proposals”, Azadi et. al. in CVPR 2017, which improves AP from 15.0% to 15.5% for Faster-RCNN. This shows that improving NMS is hard. While we agree that solving NMS with a learning based approach is very interesting, however, Soft-NMS should be used as a baseline instead of NMS in subsequent papers

[R1] Thanks R1 for suggesting to measure recall. For top 100 detections, Soft-NMS improves recall by 7.5% and even for top 10 detections by 4.2%. Table 4 will be updated

[R1] Trade-off between FP and FN, double counting:
As mentioned above, recall for NMS and soft-NMS differ. For example, for class dog at 0.7 IoU, if recall for NMS is 60% and that for Soft-NMS is 75%, then computing FDR (false discovery rate) beyond 60% won’t be possible for NMS. So, recall values will be invalid for many Ot/classes. Hence, we cannot report the mean FDR (over Ots and classes) at a fixed recall. To give an indication, we present FDR for person class.

Columns represent recall - 10 to 70, rows represent Ot - 0.5 to 0.8 in the table below. Note that FDR decreases in many cases. However, at higher recall values (near to maximum recall), like 70% at 0.5 or 40% at 0.8, FDR is significantly lower for Soft-NMS. Results shown are for r-fcn on coco-minival.

NMS
6.0, 1.3, 3.0, 5.0, 7.0, 10.4, 24.6
1.7, 3.0, 5.3, 8.3, 11.5, 17.7, -1
4.8, 7.6, 11.8, 16.2, 22.9, -1, -1
13.2, 21.7, 32.7, 61.9, -1, -1, -1
Soft-NMS
6.0, 1.4, 3.0, 4.9, 6.9, 10.2, 21.8
1.7, 3.1, 5.1, 7.8, 11.1, 17.8, 54.6
4.7, 7.8, 11.7, 16.0, 22.7, 58.8, -1
13.2, 22.0, 32.4, 50.7, 93.4, -1, -1

Based on FDR, it does not seem that Soft-NMS increases the double counting problem.

[R1] "Bells-and-whistles" :
We agree that most “bells and whistles” don’t require retraining and will redo the first paragraph. However, Soft-NMS is different from these methods as it “replaces” an existing algorithm (NMS) rather than adding another post processing step. Steps like Iterative BB regression (IBBR) /MST also increase the inference time for faster-rcnn. OHEM/multi-scale training/roi-align (from mask-rcnn) which improve mAP by ~1% each, require re-training. Our implementation of IBBR could only improve our baseline which obtains 39.8% by 0.1%.

[R1] Why [A,B] cannot be applied to generic object detection :

Thanks for mentioning A. We will expand the discussion on APC/MAPC in the related work section. In multi-class detection, an algorithm generates a fixed number of detections, without assigning scores. This makes it very hard to obtain high precision because less confident detections are assigned the same score.

Regarding B, they report that mAP drops by 6% at 0.5 overlap and mention greedy NMS is hard to beat, Sec 4.2 (in B). Please note that Fig 9 in B reports recall and not precision. Therefore, we are not sure if minor changes would make it suitable for generic object detection. MAPC should also suffer from the same problem and does not report mAP. Further, APC (which MAPC extends) takes 1 second per image, making it impractical for many applications.

[R1] Experimental details:
- L395: For NMS we set it to -inf. For Soft-NMS it is 0.01 in the paper.

- L469-73: runtime
At 10e-4, using 4 CPU threads, it takes 0.01s per image for 80 classes. This can be made faster on a GPU. After each iteration, detections which fall below the threshold are discarded. This reduces computation time. At 10e-2, run time is 0.005 seconds on a single core.

- L427-28: We use the detectors provided by the authors available online. The faster-rcnn detector was trained on 07 trainval while r-fcn was trained on 0712 trainval.

[R1] Minor details:
Thanks.

[R2] Citation:
We will update the eqn and references. We agree with all of R2's points, so have nothing more to say.
2Confidential Comments to the ACs/PCs (not visible to Reviewers)