SUMMARY
On February 17, 2022 from 13:47 to 14:30 UTC, Explore customers hosted in the US region database, experienced downtime where the queries and dashboard weren’t loading for most of them.
Timeline
14:30 UTC | 06:30 PT
We are aware of Explore US customers having issues loading their dashboards. Investigation is underway.
14:40 UTC | 06:40 PT
We appreciate your patience while we work through issues related to Explore dashboard loading for some US hosted customers. The team is seeing some improvement. We should share more details as soon as we have them.
14:50 UTC | 06:50 PT
We are happy to report that the Explore US issue with dashboard loading has been resolved. Thank you for your patience!
POST-MORTEM
Root Cause Analysis
This incident was caused by insufficient system capacity at the time of an internal update. The cluster was in the middle of downscaling, which is an activity that can take a while. In the interval in which the cluster was doing that, there was an aggressive scale-up of engine-background-workers, causing the cluster scaling activity to pause.
Also during this time, the management service went down and failed to reschedule itself on the cluster, due to a combination of 2 factors: The cluster did not have nodes with sufficient memory and the Cluster could not scale-out because its scaling activity had been paused.
Resolution
The system eventually recovered by itself by unfreezing and scaling up all the other tasks. The team has also performed small manual changes to bump the number of row machines loaded, but overall it recovered without further action on our end.
Remediation Items
- Increase the minimum size of the Cluster to reduce the chance of scheduling failure [Done]
- Created additional alerts to inform of such errors [Done]
- Add connection limits on specific applications to reduce autoscaling and prevent further issues [To Do]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.