vLLM is a fast and easy-to-use library for LLM inference and serving. It provides high-throughput and memory-efficient inference for large language models (LLMs) using state-of-the-art serving technologies including: - PagedAttention for efficient KV cache memory management - Continuous batching of incoming requests - Optimized CUDA kernels (on supported platforms) - Hugging Face model compatibility - Various decoding algorithms including parallel sampling and beam search - OpenAI-compatible API server On FreeBSD, vLLM runs in CPU/empty device mode (VLLM_TARGET_DEVICE=empty), providing pure Python inference without GPU acceleration.