PhD Proposal: Improving Efficiency of Transformer Foundation Models
Transformers are the foundational deep learning architecture behind many recent successes in diverse fields such as natural language processing, speech, computer vision, and biology. We examine key computational inefficiencies within the Transformer architecture and explore potential remedies. Specifically, we address three primary challenges. First, the standard attention mechanism scales quadratically with input sequence length due to its use of a softmax-based exponential kernel. We discuss how approximating this kernel with a linear estimation can reduce this complexity to linear time. Second, the all-to-all attention calculation, while necessary during training, becomes redundant during inference because most attention values are negligible. We review successful hierarchical strategies that combine coarse-grained token compression with fine-grained token selection to preserve both global context and local precision efficiently. Finally, the feed-forward networks (FFNs) in deep neural architectures, including Transformers, often develop sparse weights—a phenomenon described by the "Lottery Ticket Hypothesis." We explore how leveraging efficient sparse matrix multiplication accelerators can exploit this sparsity to speed up both inference and finetuning.