vLLM is a fast, easy-to-use library for LLM inference and serving, originally developed in UC Berkeley's Sky Computing Lab. Maximizes throughput with PagedAttention and advanced scheduling with continuous batching for peak GPU utilization. Achieves up to 24× throughput vs HuggingFace Transformers and TGI with much less KV cache waste. January 2025: vLLM V1 alpha with 1.7× speedup, zero-overhead prefix caching, enhanced multimodal. May 2025: became PyTorch Foundation hosted project. Supports NVIDIA, AMD, Intel, Arm, PowerPC, and TPU hardware. Used in llm-d Kubernetes-native serving stack by Red Hat, Google Cloud, IBM, NVIDIA, CoreWeave.

vLLM

About

Compatibility

Supported Languages

Details

Resources