Rethinking AI Systems Through Efficient Model Communication
For decades, AI models interacted with humans directly through human-readable inputs and outputs (e.g., texts, images). Today, they are used much more ubiquitously and often interact through complex software systems, interacting with other models or software rather than directly with humans. This paradigm shift raises a natural question: can models interact with other models and software using model-native languages?
In this talk, I will present my work on facilitating model-native interactions among models and between models and software. To enable more efficient and practical model interactions using model-native states (i.e., KV cache) in LLM systems, my work CacheGen is the first system to share KV cache across different user queries by compressing it into compact bitstreams, and my work DroidSpeak is the first system to share KV cache across different models. My research has made real-world impacts via the open-source project, LMCache, widely used in production by top-tier AI companies. Together, these works make LLM inference 5–10x faster than state-of-the-art inference engines. To enable more accurate model-to-software communication, my work ChameleonAPI encodes software code structure into model-native loss functions, allowing models to be retrained for up to 43% higher application-level accuracy in vision applications.