Building Resiliency In Serving

Created: 2024-02-20 14:17
#quicknote

Without resiliency, ML solutions would suffer from inconsistency, customer concerns, and loss of value.

Model resiliency:

  • validate inputs during inference
  • track resources utilization to identify choke points
  • monitor operational metrics to identify degradation
  • measure model drift and look for decay
  • analyze model performance for bias

Service resiliency:

  • add redundant nodes for tasks and services
  • implement autoscaling to handle sudden changes in load
  • throttle incoming requests to alleviate sudden bursts
  • deploy in additional locations for geo-resiliency
  • use redundant storage schemes to handle disk outages

Solution resiliency:

  • impact on user experience in case of outages should be assessed, monitored, and alleviated
  • create multi-region deployments of the solution
  • load balance user requests across regions in case of service issues
  • use circuit breakers in clients to overcome broken connections
  • provide default/alternate functionality to users

Tags

#mlops #ml