PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications

Best AI papers explained - A podcast by Enoch H. Kang

Categories:

The research introduces PrefillOnly, a novel inference engine specifically designed for Large Language Models (LLMs) used in discriminative tasks, where only a single output token is generated. Unlike traditional LLM engines optimized for variable-length outputs, PrefillOnly significantly reduces GPU memory consumption by only storing the Key-Value (KV) cache of the last computed layer and by using hybrid prefilling to manage intermediate tensor sizes. Furthermore, its Job Completion Time (JCT)-aware scheduling continuously calibrates based on prefix cache hits, leading to improved throughput and reduced latency, outperforming existing solutions in these specific workloads. This approach paves the way for more efficient deployment of LLMs in applications like recommendations and credit verification.