Speaker
Tianle Cai
Affiliation
Princeton University
Details
Event Description
Recent breakthroughs in AI have been predominantly driven by scaling neural networks to unprecedented sizes. However, as these models grow larger, efficient inference becomes crucial for democratizing AI access. This talk explores systematic approaches to co-designing the model inference stack for enhanced efficiency. I will present our recent advances in three key areas: (1) accelerating language model inference through multi-head decoding architectures, (2) optimizing vision model performance via parallel computation and low-precision quantization techniques, and (3) developing coordinated multi-model inference systems.
Advisers: Jason lee and Kai Li
Zoom Meeting: https://princeton.zoom.us/my/tianle