PhD Proposal: Improved Training of Deep Networks for Computer Vision
Deep neural networks have become the state-of-the-art tool to solve many computer vision problems. However, these algorithms face a lot of computational and optimization challenges. For example, a) the training of deep networks is not only computationally intensive but also requires a lot of manual effort, b) for some particular use-cases, such as adversarial and binary deep networks, it’s even difficult to optimize to achieve good performance. In this proposal, we address these challenges by targeting the following closely related problems.First, we focus on the problem of automating the step-size and decay parameters in the training of deep networks. Classical stochastic gradient methods for optimization rely on noisy gradient approximations that become progressively less accurate as iterates approach a solution. The large noise and small signal in the resulting gradients makes it difficult to use them for adaptive step-size selection. We propose alternative “big batch” SGD schemes that adaptively grow the batch size over time to maintain a nearly constant signal-to-noise ratio in the gradient approximation. The high fidelity gradients enable automated learning rate selection and do not require stepsize decay. Also, big batches can be parallelized across many machines, reducing training time and efficiently utilizing resources.Second, we propose a stable training method for adversarial deep networks. Adversarial neural networks solve many important problems in data science, but are notoriously difficult to train. These difficulties come from the fact that optimal weights for adversarial nets correspond to saddle points, and not minimizers, of the loss function. The alternating stochastic gradient methods typically used for such problems do not reliably converge to saddle points, and when convergence does happen it is often highly sensitive to learning rates. We propose a simple modification of stochastic gradient descent that stabilizes adversarial networks. We show, both in theory and practice, that the proposed method reliably converges to saddle points, and is stable with a wider range of training parameters than a non-prediction method. This makes adversarial networks less likely to “collapse”, and enables faster training with larger learning rates.Finally, as future work, we propose a new method to binarize both weights and activations in deep networks at run-time. Binarization of both weights and activations can lead to a drastic reduction in memory size, power consumption and inference time by replacing computationally intensive convolutions with bitwise operations. This makes them ideal to deploy on embedded devices, mobile phones, and wearable devices, etc. And there has been a lot of interesting work in this direction. However, most of these works either show a large performance degradation or have to increase the width of the network by a large margin. We believe that this degradation is due to inefficient optimization methods that require replacing the binarization function with its smooth approximation during the backward pass of gradient descent. Inspired by this, we propose a principled optimization formulation that takes into account the difference in the model during the forward and backward passes.
Chair: Dr. David Jacobs Dept. rep: Dr. Thomas Goldstein Members: Dr. Rama Chellappa