PhD Proposal: Toward Efficient Language Models: Structural Insights and Scalable Designs
The growing scale of large models has enabled impressive generalization across language, vision, and multimodal tasks, but also introduced significant challenges in computation, memory, and deployment. This dissertation aims to improve the efficiency of large models by leveraging structural insights and designing scalable, adaptive architectures.We begin by analyzing the internal redundancy of Transformer models, demonstrating that a substantial portion of attention and MLP components can be removed or sparsified with minimal impact on performance. These findings motivate the development of conditional computation techniques—such as Mixture-of-Experts (MoE) and dynamic depth routing—that reduce unnecessary computation based on input characteristics. To support scalable inference, we further propose capacity-aware routing and token rescheduling strategies that mitigate straggler effects and improve hardware utilization.Our methods are validated across multiple application domains, including natural language processing, representation learning, and vision-language understanding. Together, these contributions offer a principled framework for building large models that are both efficient and deployment-ready.