Rethinking AI Model Storage with Model-aware and Tensor-centric Compression

Talk

Yue Cheng

Talk Series:

Visitors

Time:

04.13.2026 11:00 to 12:00

Location:

IRB 4105 or https://umd.zoom.us/j/93666933047?pwd=gWgqOgGbBP6laZclyURdDG2mNdArBt.1

URL:

https://talks.cs.umd.edu/talks/4572

Large-scale model hubs such as Hugging Face and ModelScope have become core infrastructure for modern AI. They host millions of pretrained and fine-tuned models, especially large language models, and support a broad ecosystem of downstream use across both industry and academia. But this infrastructure is coming under growing pressure: by late 2025, Hugging Face alone hosted more than 77 PB of model artifacts, and its storage footprint continues to grow exponentially, posing mounting cost and sustainability challenges. In this talk, I will present a new perspective on AI model hubs: rather than storing model artifacts simply as collections of independent blob files, I will show how to uncover and exploit hidden structure across models at scale. At one level, models are related through fine-tuning and evolution. At another, models are composed of tensors, and tensors across different models have subtle connections. Together, these hidden patterns create new opportunities for rethinking how model hubs are stored and managed. I will first introduce ZipLLM, which redesigns storage reduction around model lineage. I will then show why model-level lineage alone is not enough: substantial reduction opportunities remain hidden at tensor granularity. To address this, I will present TensorDex, a tensor-centric model compression system that achieves significant lossless storage reduction for large-scale model hubs. Finally, I will argue that these results open up a new AI+data systems direction: tensor-centric AI infrastructure. I will conclude by outlining this vision and discussing future research directions.