You work for a small company that has deployed an ML model with autoscaling on Vertex AI to serve online predictions in a production environment. The current model receives about 20 prediction requests per hour with an average response time of one second. You have retrained the same model on a new batch of data, and now you are canary testing it, sending ~10% of production traffic to the new model. During this canary test, you notice that prediction requests for your new model are taking between 30 and 180 seconds to complete. What should you do?

Question

Accepted Answer

C. Remove your new model from the production environment. Compare the new model and existing The new model's prediction latency (30-180 seconds) is drastically higher than the existing model's (1 second), indicating a severe performance regression or critical issue. The immediate and safest action is to remove the problematic model from production to prevent further user impact, then conduct a thorough investigation and comparison to identify the root cause.

Answer

A. Submit a request to raise your project quota to ensure that multiple prediction services can run Quota issues typically manifest as deployment failures or resource unavailability, not as extremely high individual prediction latencies for a small percentage of traffic, especially if the old model is working correctly.

Answer

B. Turn off auto-scaling for the online prediction service of your new model. Use manual scaling with Turning off autoscaling and using manual scaling with more instances might mitigate throughput issues, but it will not resolve the fundamental problem of high latency per request if the model itself is slow or inefficient, and it incurs higher costs.

Answer

D. Remove your new model from the production environment. For a short trial period, send all Sending all traffic to a model with 30-180 second latency would severely degrade user experience and is a highly risky action, exacerbating the problem rather than solving it.

You work for a small company that has deployed an ML model with autoscaling on Vertex AI to serve online predictions in a production environment. The current model receives about 20 prediction request

Question

Options

How the community answered

Why each option

Topics

Community Discussion