PhD Defense: Object Detection and Instance Segmentation for Real-world Applications

Shiyi Lan
12.07.2021 10:00 to 12:00

IRB 4105

Object detection and instance segmentation are the fundamental steps in computer vision which are required by many downstream tasks e.g. scene parsing, video understanding, 3D reconstruction, and visual navigation. On the on hand, the real-world applications are usually employed on the hardware with small memory and limited computation. On the other hand, obtaining fully annotated data is very expensive. Therefore, my PhD research mainly focus on solving these two issues. First, designing efficient and high-speed models for object detection instance segmentation learning. Second, designing model that instance segmentation and object detection with more data with no labels or weak labels.To improve the image based object detection, we propose the SaccadeNet, which is an efficient and high-performance anchor-free one-stage object detection. We combine the merits of both two-stage and one-stage object detectors and make the detectors more accurate with little computation overhead. To improve the point cloud based object detection, we propose three approaches to improve the efficiency and accuracy. First, we introduce Geo-CNN to improve the point cloud modeling by introducing CNN-like operator for point cloud which emphases the geometry modeling. Second, we introduce InfoFocus, which improves the accuracy of 3D object detection with little overhead by forcing the network attend to the most informative part of point cloud. Thirdly, we introduce M3DETR, which models the point cloud by using transformers to fuse multi-representation feature efficiently.To improve the image based instance segmentation, we propose FastMask, which generates instance segmentation candidates using feature pyramid as opposed to image pyramids and attention mechanism. The feature pyramid module helps the networks to generate multi-scale instance segmentation candidates without image pyramid which makes FastMask very fast. The attention learned using box labels helps the networks generate more accurate mask by reducing noise. On the other hand, the detection labels are more accessible than the instance segmentation labels. To enable the learning of instance segmentation on more datasets with only box-level labels, we propose DiscoBox, which leverage box-level annotations to multi-task learning of instance segmentation and semantic correspondence. We propose the structured teacher to generate high-quality instance segmentation by leveraging inner-image and inter-image potentials. DiscoBox achieves 89% performance which the counter-part supervised approach achieves.Our future work contains continual learning, exploring transformers and fusing the multi-modality information e.g. images, point clouds, languages. We believe these techniques will further improve the efficiency and the generalization of object detection and instance segmentation.Examining Committee:

Chair:Co-Chair:Dean's Representative:Members:

Dr. Larry S. Davis Dr. Abhinav Shrivastava Dr. Behtash Babadi Dr. Dinesh Manocha Dr. Matthias Zwicker Dr. Ming Lin Dr. Tom Goldstein