Inference is the process of running a trained AI model to generate predictions or outputs — as opposed to training, which is the process of adjusting model weights. In LLMs, inference is autoregressive token generation: predicting one token at a time from left to right. Efficient inference is critical for cost and latency in production AI systems.
How LLM inference works
LLM inference has two distinct phases with very different computational characteristics — understanding them is key to optimizing serving cost and latency.
| Phase | What happens | Compute type | Bottleneck |
|---|---|---|---|
| Prefill | All input (prompt) tokens processed in a single parallel forward pass | Compute-bound: all tokens computed in parallel | FLOPS — add more GPUs to speed up |
| Decode | Generate output tokens one at a time, each requiring a full forward pass | Memory-bandwidth-bound: reads all KV cache per step | GPU memory bandwidth — hard to parallelize |
Autoregressive generation: at each step t, the model produces hidden state h_t from all preceding tokens, projects to vocabulary via W_U (unembedding matrix), and samples the next token. With KV cache, h_t computation only requires the new token — past key-value pairs are cached.
Why decode is memory-bandwidth-bound
During decode, generating each token requires reading all model weights (~140GB for a 70B model in FP16) from GPU HBM to SRAM — but only performs a tiny amount of compute (one token's worth). An H100 has 3.35TB/s memory bandwidth but 2,000 TFLOPS. For a single-token forward pass, the model is memory-bound by ~1000×. This is why batching decode requests dramatically improves GPU utilization — amortizing the weight read cost across many concurrent tokens.
KV cache and memory management
The KV cache is the single most important memory structure in LLM inference — understanding it explains why long contexts are expensive and why batching is complex.
KV cache memory formula: 2 (K and V) × layers × heads × head_dim × sequence_length (T) × batch_size (B) × bytes per element. For LLaMA 3 70B: 2 × 80 × 8 × 128 × T × B × 2 bytes (FP16) = 0.32MB per token per batch element. At T=8K, B=32: 82GB just for KV cache.
| Technique | What it does | Memory savings |
|---|---|---|
| KV cache | Cache past key-value pairs; reuse in each decode step | O(n²) → O(n) compute; enables long contexts |
| PagedAttention (vLLM) | Pages KV cache like OS virtual memory; shares pages across requests | 20–40% higher GPU utilization; no memory fragmentation |
| Quantized KV cache | Store KV cache in INT8 or FP8 instead of FP16 | 2–4× memory reduction with minimal quality loss |
| Sliding window attention | Only keep KV cache for last W tokens (e.g., W=4096) | O(1) memory instead of O(n) — at cost of long-range attention |
| Multi-Query Attention (MQA) | Share K and V heads across all attention heads | 8–32× smaller KV cache (used in LLaMA 3, Mistral, Gemma) |
Inference optimization techniques
| Technique | Mechanism | Speedup | Tradeoff |
|---|---|---|---|
| Speculative decoding | Draft model generates k tokens fast; main model verifies all k in one parallel pass | 2–3× | Requires matching draft model; benefit varies with acceptance rate |
| Continuous batching | Process tokens from multiple requests in same batch; replace finished sequences immediately | 5–10× throughput | Higher latency for individual requests |
| FlashAttention 2/3 | Fused CUDA kernel keeps Q,K,V in fast SRAM; avoids HBM round-trips | 2–4× attention speed, 5–20× memory | NVIDIA/AMD specific; needs CUDA |
| Tensor parallelism | Split attention heads or FFN dimensions across GPUs; all-reduce each layer | Linear with # GPUs | Communication overhead; needs fast interconnect (NVLink) |
| Pipeline parallelism | Different model layers on different GPUs; micro-batching to hide bubbles | Linear with # GPUs | Micro-batch latency; bubble overhead |
| AWQ / GPTQ quantization | Quantize weights to INT4/INT8; reduce memory bandwidth bottleneck | 1.5–4× throughput | Slight quality loss; calibration required |
Speculative decoding in depth
Speculative decoding works because: (1) generating draft tokens with a small model (e.g., 3B) is much faster than the main model (70B), and (2) verifying k tokens in parallel with the main model is no slower than generating 1 token — the forward pass is the same shape. If the draft model has ~80% token acceptance rate, you get ~3× speedup for free with identical output. Claude uses speculative decoding in production; Medusa (self-speculative with multiple heads) avoids needing a separate draft model.
LLM inference infrastructure in 2025
| Framework / Service | Type | Best for | Key feature |
|---|---|---|---|
| vLLM | Open-source server | Production throughput-optimized serving | PagedAttention, continuous batching, multi-LoRA |
| Ollama | Open-source local | Local dev, single-machine serving | One-command model download + serve; GGUF support |
| llama.cpp | Open-source library | CPU inference, low-VRAM GPU, edge deployment | Quantized GGUF; CPU+GPU split; runs on MacBooks |
| TensorRT-LLM | NVIDIA framework | Maximum performance on NVIDIA GPUs | FP8, kernel fusion, speculative decoding; H100 optimized |
| SGLang | Open-source server | Structured generation, complex multi-call workflows | RadixAttention (KV cache sharing across similar prefixes) |
| Groq LPU | Cloud inference | Fastest token generation speed | Custom LPU chip: 500+ tokens/sec on 70B; not cheapest |
| Together AI / Fireworks | Managed API | Cheap open-source model inference | Per-token pricing, open-source model access |
| AWS Bedrock / Vertex AI | Enterprise managed | Enterprise compliance + multi-provider access | SLA, VPC, audit logging, fine-tune hosting |
Cost benchmark (early 2025)
GPT-4o: ~$10–15/M output tokens. Claude 3.5 Sonnet: ~$15/M. Llama 3.1 70B via Together AI: ~$0.88/M. Self-hosted Llama 3.1 70B on vLLM (4× A100 80GB): ~$0.20/M at full utilization. The 50–75× cost gap between frontier closed models and self-hosted open-source explains why companies with high token volumes increasingly fine-tune open-source models for production.
Latency vs throughput tradeoffs
Latency and throughput are fundamentally in tension for LLM serving — optimizing one hurts the other. Choosing the right operating point depends on your use case.
| Metric | Definition | Typical target | Critical for |
|---|---|---|---|
| TTFT (Time To First Token) | Time from request sent to first token received | <500ms for interactive | Chatbots, coding assistants — perceived responsiveness |
| TPOT (Time Per Output Token) | Average time between consecutive output tokens | <50ms (~20 tok/s) | Streaming readability — faster than human reading speed |
| End-to-end latency | Total time from request to complete response | <5s for short responses | Non-streaming batch use cases |
| Throughput (tokens/sec) | Total tokens generated per second across all requests | Maximize for batch | Document processing, offline summarization pipelines |
| Requests per second (RPS) | Concurrent requests served | Varies by batch size | API scaling, cost efficiency |
Streaming and perceived latency
Streaming (Server-Sent Events, SSE) returns tokens as they are generated — the user sees text appearing word-by-word rather than waiting for the full response. This dramatically improves perceived responsiveness even if total generation time is identical. A response that takes 5s to complete feels fast if you see the first tokens in 200ms. All major LLM APIs (OpenAI, Anthropic, Groq) support streaming; always use it for interactive applications.
Practice questions
- What is the difference between throughput and latency in LLM inference, and why is there a fundamental trade-off? (Answer: Latency: time from request to first token (TTFT) + time to generate full response. User-facing — measures how fast responses feel. Throughput: total tokens generated per second across all concurrent users. Server-side — measures capacity. Trade-off: batching requests improves throughput (processes many tokens simultaneously) but increases latency for individual users (must wait for batch). At batch_size=1: minimum latency. At batch_size=256: maximum throughput. Production serving optimises the Pareto frontier between these, using continuous batching to approach both simultaneously.)
- What is continuous batching (iteration-level scheduling) and why did it transform LLM serving? (Answer: Traditional static batching: group N requests together, wait until ALL finish generating, then start next batch. If request 1 finishes in 10 tokens and request 2 in 1000 tokens, request 1's GPU slot sits idle for 990 token-steps. Continuous batching (Orca, vLLM): as soon as a request finishes, its slot is immediately replaced with a new request. The batch changes composition at every token generation step. Result: GPU utilisation goes from ~20% (static) to ~80%+ (continuous). vLLM pioneered this; it is now the standard in all production LLM serving systems.)
- What is TTFT (Time to First Token) and why is it more important than total generation time for user experience? (Answer: TTFT: elapsed time from request submission until the first output token is generated. Covers: network latency + prompt processing (prefill) + scheduling queue wait. User experience: TTFT determines how quickly the UI can show 'something is happening.' A response that streams from token 1 in 500ms feels faster than a response that starts in 2000ms — even if both complete in 5 seconds. This is why streaming is universal in production LLM APIs: show the first token immediately rather than waiting for completion.)
- What hardware is used for LLM inference and what determines model serving cost? (Answer: Primary hardware: NVIDIA H100 (80GB, $30K), H200 (141GB, $40K), A100 (80GB, $10K). AMD MI300X: competitive, gaining traction. Google TPUv5: used for internal Google serving. Cost drivers: (1) GPU VRAM (must hold model weights + KV cache). (2) GPU compute (tokens/sec per GPU). (3) Memory bandwidth (memory-bandwidth-bound decoding phase). Pricing: H100 SXM: $2–3/GPU-hour on cloud. Serving a 70B model: ~4 H100s needed, ~$8-12/hour, ~100 tokens/second → $0.023–0.033/1K output tokens (similar to commercial API pricing).)
- What is PagedAttention (used in vLLM) and how does it reduce memory waste in LLM serving? (Answer: Standard KV cache: pre-allocated contiguously for max_sequence_length. For 2048-token max: 2048 positions reserved even if request only generates 100 tokens → 95% waste. PagedAttention (Kwon et al. 2023): divides KV cache into fixed-size pages (blocks), allocating pages on demand like virtual memory. Non-contiguous pages are accessed via a block table. Result: near-zero internal fragmentation, memory utilisation from 20–30% to 90%+, supports 2–4× more concurrent requests on same hardware. PagedAttention is the core innovation of vLLM.)