PhD Defense: Adapting Next-Token Prediction for Practical Domains and Training Regimes
IRB-4105 https://umd.zoom.us/my/alexstein
Modern transformer-based machine learning has produced powerful systems, but most problems of practical interest are not natural language. This dissertation studies what it takes to bring the autoregressive next-token prediction recipe out of its native habitat and into practical domains and training regimes that look different from conventional benchmark NLP. The first half concerns adapting the recipe to new practical domains, primarily through how data is represented to the model. We introduce STEP, which shows that a standard decoder-only transformer with a causal language modeling loss outperforms specialized architectures on tabular event prediction, given column-aware tokenization and simple training-time augmentations. In the second half, we adapt the training regime itself. GATES is a self-distillation framework that derives supervision online from consensus among privileged-context rollouts, improving language models on math reasoning benchmarks without external supervision. Across these contributions, the recurring observation is that the autoregressive recipe is more general than its origins suggest, but only when it is carefully adapted using the right representations and explicit training signals.