PhD Proposal: Navigating Neural Landscapes at Large Step Sizes and Scales
Modern large-scale neural networks are trained at large step sizes—a regime outside the scope of classical optimization theory—and at scales where hyperparameters cannot be tuned by exhaustive search. We develop a dynamical understanding of neural network training in the large-step-size regime, examine these dynamics at the scale of Large Language Models (LLMs), and study how hyperparameters can be transferred from small to large models.We begin by studying the dynamics of the loss landscape through its curvature, measured by the largest eigenvalue of the Hessian (sharpness), which exhibits universal patterns throughout training, observed across architectures, optimizers, and scales. We show that a simple two-layer linear network trained on a single example captures all of these phenomena, and through fixed-point analysis of this model, we uncover the underlying mechanisms behind these universal trends. Building on this analysis, we examine learning rate warmup and show that its primary effect is to gradually reduce sharpness by raising the learning rate slowly enough to avoid large instabilities. We characterize distinct warmup regimes governed by the underlying sharpness dynamics, clarify the roles of warmup duration and peak learning rate, and propose improvements to existing warmup schedules.
A natural question is whether these sharpness phenomena persist at the scale of LLMs, but direct measurement of sharpness is computationally prohibitive at this scale. We propose critical sharpness, a computationally efficient curvature measure requiring ~5–6 forward passes, and use it to provide the first evidence of these sharpness phenomena at scale. We further introduce relative critical sharpness, which enables analysis of the transitions during training (e.g., pre-training to mid-training) and informs data-mixing strategies.
Finally, we study hyperparameter transfer, the practice of using optimal hyperparameters from small models to train large ones, which becomes essential when direct tuning at scale is infeasible. We develop a framework to quantify the reliability of hyperparameter transfer and use it to identify the ingredients of network parameterizations that enable reliable learning rate transfer across model width. We propose to extend this analysis to additional hyperparameters and scaling regimes that lie beyond the reach of current theory. We further propose to develop an automated numerical procedure for discovering parameterizations that yield reliable hyperparameter transfer for arbitrary architectures, optimizers, and scaling regimes.