Speaker
Details
The rapid advancements in large vision models have significantly improved performance across various tasks, such as classification, segmentation, and generation. However, their efficiency remains a major bottleneck, limiting their deployment in resource-constrained environments. Enhancing the efficiency of these models is crucial to enable broader accessibility, reduce computational costs, and promote sustainability in AI research.
I will mainly introduce training-free methods designed for large vision models: Zero-TPrune and AT-EDM. Zero-TPrune applies token pruning to Vision Transformers by leveraging semantic importance and similarity. Semantic importance is determined using a graph-based Weighted PageRank (WPR) algorithm, while token similarity is calculated by grouping tokens based on their semantic importance. AT-EDM generalizes the Weighted PageRank algorithm to cross-attention and deploys it in Diffusion Models for text-to-image generation. This method prunes less-important tokens based on attention maps and strategically recovers them based on similarity. AT-EDM achieves up to 50% speed-up without requiring any training or fine-tuning and can be combined with sampling distillation for further efficiency gains. Additionally, I will provide a brief overview of LinGen, our latest work that achieves linear computational complexity for text-to-video generation, enabling high-resolution minute-length video generation on a single GPU at 15x less cost.
Adviser: Niraj Jha