PROFESSIONAL-CLOUD-DEVOPS-ENGINEER · Question #21
PROFESSIONAL-CLOUD-DEVOPS-ENGINEER Question #21: Real Exam Question with Answer & Explanation
The correct answer is A: Eliminate unactionable alerts.. To prevent staff burnout from frequent, unnecessary nighttime alerts due to self-recovering systems, the best SRE practice is to eliminate alerts that do not require human intervention.
Question
You encounter a large number of outages in the production systems you support. You receive alerts for all the outages that wake you up at night. The alerts are due to unhealthy systems that are automatically restarted within a minute. You want to set up a process that would prevent staff burnout while following Site Reliability Engineering practices. What should you do?
Options
- AEliminate unactionable alerts.
- BCreate an incident report for each of the alerts.
- CDistribute the alerts to engineers in different time zones.
- DRedefine the related Service Level Objective so that the error budget is not exhausted.
Explanation
To prevent staff burnout from frequent, unnecessary nighttime alerts due to self-recovering systems, the best SRE practice is to eliminate alerts that do not require human intervention.
Common mistakes.
- B. Creating an incident report for every self-resolving, non-actionable alert would significantly increase operational toil without adding value, exacerbating burnout.
- C. Distributing unactionable alerts to engineers in different time zones only spreads the burnout geographically without addressing the root cause of the excessive alerting.
- D. Redefining the SLO to accommodate the current error rate would simply mask the issue and might lower the quality of service, rather than addressing the problem of excessive, unactionable alerts.
Concept tested. SRE alert management and toil reduction
Reference. https://sre.google/sre-book/monitoring-alerting/#alert-management
Topics
Community Discussion
No community discussion yet for this question.