SUMMARY
On June 4, 2021 from 17:29 UTC to 18:36 UTC, Zendesk saw increased error rates across Guide and Support on Pods 19 and 23. During the same period, some customers using Talk and Apps on all pods suffered increases in error rates. After 18:36 UTC errors dropped off to healthy levels for all services except Talk, which recovered by 19:12 UTC. During the incident, customers on Pods 19 and 23 would have experienced bad gateway and timeout errors on Support and Guide, while some customers on all pods may have experienced Talk dropped calls and Apps not loading.
Timeline
20:36 UTC | 13:36 PT
The Pod 19 and 23 service degradation that also impacted Apps and Talk has been resolved and service has been restored.
20:09 UTC | 13:09 PT
The Pod 19 and 23 service degradation that also impacted Apps and Talk is stable and our team is currently monitoring. We'll update here again once an all-clear has been called.
19:16 UTC | 12:16 PT
We are happy to report that issues impacting Talk are in recovery and service is being restored. Our team will continue to monitor as they remediate a root cause.
18:49 UTC | 11:49 PT
We are seeing system recovery improvements in affected systems but Talk usability remains intermittent with inaccessibility and dropped call reports continuing. We are actively working on the issue to restore performance.
18:36 UTC | 11:36 PT
We have identified the issue and have made a configuration change. We are now seeing error rates drop and systems recover. We are continuing to monitor performance.
18:26 UTC | 11:26 PT
Customers on Pods 19 and 23 may experience 502 bad gateway errors and timeout errors. Customers on all pods may experience issues loading apps. We are actively working on the issue.
18:03 UTC | 11:03 PT
We are investigating errors for customers on Pods 19 and 23. More information to come.
Root Cause Analysis
As part of an ongoing effort to decommission unused infrastructure, some backend processes were terminated and removed, including the removal of a service discovery cluster that was marked for decommissioning. Due to an unusual legacy configuration, the cluster was incorrectly thought to be isolated from other pods but in fact had dependencies with Pods 19 and 23, and to a lesser extent, all other pods. Because of these dependencies, the removal of the cluster caused application errors and service degradation.
Resolution
To fix this issue, these backend processes were restored from a previous stable configuration.
Remediation Items
- Implement a rigorous procedure for decommissioning infrastructure including validating zero client traffic, stopping instances and validating before removal, followed by gradual removal.
- Remove all remaining cross-region service discovery dependencies.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.