Skip to main content
llm.info

vLLM

Open Source
Library

High-throughput LLM inference engine

About

vLLM is a fast, easy-to-use library for LLM inference and serving, originally developed in UC Berkeley's Sky Computing Lab. Maximizes throughput with PagedAttention and advanced scheduling with continuous batching for peak GPU utilization. Achieves up to 24× throughput vs HuggingFace Transformers and TGI with much less KV cache waste. January 2025: vLLM V1 alpha with 1.7× speedup, zero-overhead prefix caching, enhanced multimodal. May 2025: became PyTorch Foundation hosted project. Supports NVIDIA, AMD, Intel, Arm, PowerPC, and TPU hardware. Used in llm-d Kubernetes-native serving stack by Red Hat, Google Cloud, IBM, NVIDIA, CoreWeave.

Compatibility

Supported Languages

python
c++

Details

Category
Library