LLM Deployment and Serving¶
Getting models from research to production. High-performance inference engines, serving frameworks, and optimization tools that make LLMs fast, cost-efficient, and scalable.
Inference Engines¶
| Name | Description | Links |
|---|---|---|
| vLLM | High-throughput and memory-efficient inference and serving engine for LLMs with PagedAttention. | Docs |
| TGI (Text Generation Inference) | HuggingFace's production-ready inference server for LLMs. | Docs |
| BentoML | Framework for building reliable, scalable, and cost-efficient AI applications with model serving. | bentoml.com |
| LMDeploy | Toolkit for compressing, deploying, and serving LLMs with efficient quantization and inference. | - |
| MLC LLM | Machine Learning Compilation for LLMs - enables native deployment of any LLM on diverse hardware. | mlc.ai |
| LightLLM | Lightweight, high-performance LLM inference framework. | - |
| FastLLM | Efficient and easy-to-use LLM inference library for CPU/GPU. | - |
| DeepSpeed-MII | Model Implementations for Inference by DeepSpeed. Low-latency, low-cost inference. | - |
| CTranslate2 | Fast inference engine for Transformer models with quantization, pruning, and optimized execution. | - |
| OpenLLM | Operating LLMs in production. Fine-tuning, serving, deploying, and monitoring. | - |
Production AI Orchestration & Cloud Platforms¶
Training a frontier model is only half the battle -- orchestrating it in production is the other. These platforms handle the end-to-end lifecycle: model hosting, autoscaling, monitoring, A/B testing, and supporting massive context windows at enterprise scale.
| Name | Description | Links |
|---|---|---|
| Google Vertex AI | Google Cloud's unified AI platform. Model Garden (100+ models including Gemini), managed pipelines, AutoML, and Gemini API with native 1M+ token context support. The production backend for Gemini-scale deployments. | Docs |
| Amazon Bedrock | Fully managed service for building generative AI applications. Access Claude, Llama, Titan, and more via unified API. Guardrails, RAG (Knowledge Bases), fine-tuning, and Agents for multi-step tasks. | Docs |
| Amazon SageMaker | Complete ML platform for building, training, and deploying models at scale. SageMaker HyperPod for distributed training, JumpStart for model hub, and real-time + batch inference endpoints. | Docs |
| Azure AI Studio | Microsoft's platform for building and deploying enterprise AI. Unified model catalog (OpenAI, Meta, Mistral), prompt flow orchestration, content safety, and Azure OpenAI Service for GPT-4/o1 deployment. | Docs |
| NVIDIA Triton Inference Server | Production inference serving for any framework (TensorFlow, PyTorch, ONNX, vLLM, TensorRT-LLM). Dynamic batching, model ensembles, multi-GPU, and multi-node inference. The standard for high-throughput GPU inference. | Docs, GitHub |
| KServe | Kubernetes-native model serving. Autoscaling (including scale-to-zero), canary deployments, request batching, and GPU inference on Kubernetes. CNCF-backed. | Docs, GitHub |