Scaling Model Serving
Created: 2024-02-20 14:14
#quicknote
- Horizontal scaling: n+1 deployments
- Vertical scaling (GPUs/TPUs)
- GPUs offer low computational latency, but expensive
- use them only when absolutely needed
- optimal for batch requests
- Autoscaling
- Results caching
Tags
#mlops #ml