PhD Defense: IMPROVED TRAINING OF DEEP NETWORKS FOR COMPUTER VISION

Talk
Abhay Yadav
Time: 
12.10.2021 10:00 to 12:00
Location: 

IRB 4105

Deep neural networks have become the state-of-the-art tool to solve many computer vision problems. However, these algorithms face a lot of computational and optimization challenges. For example, a) the training of deep networks is not only computationally intensive but also requires a lot of manual effort and parameter turning, b) for some particular use-cases, such as adversarial deep networks, it’s even challenging to optimize to achieve good or stable performance. In this dissertation, we address these challenges by targeting the following closely related problems.First, we focus on the problem of automating the step-size and decay parameters in the training of deep networks. Classical stochastic gradient methods for optimization rely on noisy gradient approximations that become progressively less accurate as iterates approach a solution. The large noise and small signal in the resulting gradients makes it difficult to use them for adaptive step-size selection. We propose alternative “big batch” SGD schemes that adaptively grow the batch size over time to maintain a nearly constant signal-to-noise ratio in the gradient approximation. The high fidelity gradients enable automated learning rate selection and do not require stepsize decay. Also, big batches can be parallelized across many machines, reducing training time and efficiently utilizing resources.Second, in the similar pursuit of automated and efficient training of deep networks, we explore the use of L-BFGS for large-scale machine learning applications. L-BFGS, a very successful second-order optimization method for convex problems, is not even considered an algorithm of choice for these applications. Recent work has shown that a stochastic version of L-BFGS can perform comparably to the current state-of-the-art solvers such as SGD or Adam for classification tasks. However, their work is limited to deep networks that do not use batch normalization. Since batch normalization is a de facto standard and essential for good performance in practical industrial-strength deep networks, this renders their work somewhat less practical. To this end, we propose a new variant of stochastic L-BFGS, which can work for deep networks that use batch normalization. We demonstrate the effectiveness of the proposed method by providing both convergence analysis and empirical results on standard deep networks and image classification. The proposed method outperforms Adam and existing approaches for L-BFGS by a large margin (10% in some cases) and is comparable to carefully tuned SGD for some cases. Although we do not surpass the generalization performance of carefully tuned SGD, this work marks another significant step towards considering L-BFGS as an effective algorithm for large-scale machine learning.Third, we propose a stable training method for adversarial deep networks. Adversarial neural networks solve many important problems in data science, but are notoriously difficult to train. These difficulties come from the fact that optimal weights for adversarial nets correspond to saddle points, and not minimizers, of the loss function. The alternating stochastic gradient methods typically used for such problems do not reliably converge to saddle points, and when convergence does happen it is often highly sensitive to learning rates. We propose a simple modification of stochastic gradient descent that stabilizes adversarial networks. We show, both in theory and practice, that the proposed method reliably converges to saddle points, and is stable with a wider range of training parameters than a non-prediction method. This makes adversarial networks less likely to “collapse”, and enables faster training with larger learning rates.Finally, we propose to efficiently compute the Neural Tangent Kernel (NTK) by establishing (both theoretically and empirically) that for most practical use-cases, NTK can be replaced by the well- known Laplace kernel, which is computationally much cheaper than NTK. NTK is interesting and important because it can reasonably well approximate the solution of a massively overparameterized neural network that is trained using SGD. So, another advantage of this finding is that one can get more insight into infinite width real neural networks by analyzing the Laplace kernel, which has a simple closed form (which NTK does not have).Examining Committee:

Chair:Dean's Representative:Members:

Dr. David W. Jacobs Dr. Behtash Babadi Dr. Abhinav Shrivastava Dr. Ramani Duraiswami Dr. Tom Goldstein