Scaling Model Serving

Created: 2024-02-20 14:14
#quicknote

  • Horizontal scaling: n+1 deployments
  • Vertical scaling (GPUs/TPUs)
    • GPUs offer low computational latency, but expensive
    • use them only when absolutely needed
    • optimal for batch requests
  • Autoscaling
  • Results caching

Tags

#mlops #ml