PhD Proposal: Architectural Approaches to Reasoning in Language Models
Scaling laws are typically fit using a family of models with a narrow range of frozen hyperparameter choices. In the first part of this work we study scaling laws using multiple architectural shapes and hyperparameter choices, highlighting their impact on resulting prescriptions. Our checkpoints enable more complex studies of scaling, such as analyzing the relationship between width and depth, finding increased depth improves both final loss and benchmark accuracy.
Next we explore how increasing model depth via depth recurrence may increase the arithmetic reasoning capabilities of transformers. We begin by studying addition, and find that the poor performance of transformers on such arithmetic tasks seems to stem in large part from their inability to keep track of the exact position of each digit inside of a large span of digits. We mend this problem by adding an embedding to each digit that encodes its position relative to the start of the number. In addition to the boost these embeddings provide on their own, we show that other architectural modifications such as input injection and recurrent layers can improve performance even further.
Finally, we extend our studies of the relationship between transformer depth and reasoning capability to general language modeling settings and develop a procedure for converting existing pretrained non-recurrent language models into depth-recurrent models. In our experiments on mathematical tasks, we observe that converting pretrained models to depth-recurrent ones results in better performance at a given compute budget than simply post-training the original non-recurrent language model.