Scaling Model Serving

Created: 2024-02-20 14:14
#quicknote

Horizontal scaling: n+1 deployments
Vertical scaling (GPUs/TPUs)
- GPUs offer low computational latency, but expensive
- use them only when absolutely needed
- optimal for batch requests
Autoscaling
Results caching