From Sparse to Dense, and back to Sparse again?
IRB 4105 https://umd.zoom.us/j/7316339020
Computer vision architectures used to be built on a sparse sample of points in the 80s and 90s. In the 2000s, dense models started to become popular for visual recognition as heuristically defined sparse models do not cover all the important parts of an image. However, with deep learning and end-to-end training approaches, this does not have to continue and sparse models may still have significant advantages in saving unnecessary computation as well as being more flexible. In this talk, I will talk about the deep point cloud convolutional backbones that we have developed in the past few years, including the most recent work PointConvFormer that outperforms grid-based convolutional approaches. I will also talk about a recent work, AutoFocusFormer, that uses point cloud transformer backbones and decoders to work on 2D image recognition, with a novel adaptive downsampling module that enables the end-to-end learning of adaptive downsampling. Results show significant improvements in both 3D and 2D recognition tasks. Especially, on the CityScapes benchmark, a model with only 42 million parameters with our approach outperforms the state-of-the-art Mask2Former Large model with 197 million parameters.