LLM Deployment and Serving¶

Getting models from research to production. High-performance inference engines, serving frameworks, and optimization tools that make LLMs fast, cost-efficient, and scalable.

Inference Engines¶

Name	Description	Links
vLLM	High-throughput and memory-efficient inference and serving engine for LLMs with PagedAttention.	Docs
TGI (Text Generation Inference)	HuggingFace's production-ready inference server for LLMs.	Docs
BentoML	Framework for building reliable, scalable, and cost-efficient AI applications with model serving.	bentoml.com
LMDeploy	Toolkit for compressing, deploying, and serving LLMs with efficient quantization and inference.	-
MLC LLM	Machine Learning Compilation for LLMs - enables native deployment of any LLM on diverse hardware.	mlc.ai
LightLLM	Lightweight, high-performance LLM inference framework.	-
FastLLM	Efficient and easy-to-use LLM inference library for CPU/GPU.	-
DeepSpeed-MII	Model Implementations for Inference by DeepSpeed. Low-latency, low-cost inference.	-
CTranslate2	Fast inference engine for Transformer models with quantization, pruning, and optimized execution.	-
OpenLLM	Operating LLMs in production. Fine-tuning, serving, deploying, and monitoring.	-

Production AI Orchestration & Cloud Platforms¶

Training a frontier model is only half the battle -- orchestrating it in production is the other. These platforms handle the end-to-end lifecycle: model hosting, autoscaling, monitoring, A/B testing, and supporting massive context windows at enterprise scale.

Name	Description	Links
Google Vertex AI	Google Cloud's unified AI platform. Model Garden (100+ models including Gemini), managed pipelines, AutoML, and Gemini API with native 1M+ token context support. The production backend for Gemini-scale deployments.	Docs
Amazon Bedrock	Fully managed service for building generative AI applications. Access Claude, Llama, Titan, and more via unified API. Guardrails, RAG (Knowledge Bases), fine-tuning, and Agents for multi-step tasks.	Docs
Amazon SageMaker	Complete ML platform for building, training, and deploying models at scale. SageMaker HyperPod for distributed training, JumpStart for model hub, and real-time + batch inference endpoints.	Docs
Azure AI Studio	Microsoft's platform for building and deploying enterprise AI. Unified model catalog (OpenAI, Meta, Mistral), prompt flow orchestration, content safety, and Azure OpenAI Service for GPT-4/o1 deployment.	Docs
NVIDIA Triton Inference Server	Production inference serving for any framework (TensorFlow, PyTorch, ONNX, vLLM, TensorRT-LLM). Dynamic batching, model ensembles, multi-GPU, and multi-node inference. The standard for high-throughput GPU inference.	Docs, GitHub
KServe	Kubernetes-native model serving. Autoscaling (including scale-to-zero), canary deployments, request batching, and GPU inference on Kubernetes. CNCF-backed.	Docs, GitHub