SUMMARY
On July 7, 2020 from 06:00 UTC to 07:13 UTC, some Zendesk Support, Sunshine, Chat, Guide, Sell and Explore customers on Pod 25 experienced service degradation or server errors. In addition, Explore datasets were stale during this time and may not have been refreshed until 16:39 UTC.
Timeline
16:57 UTC | 09:57 PT
We have identified the source of the issue and the Support, Chat, Guide, and Explore products should now be functioning normally.
14:17 UTC | 07:17 PT
Explore stability still being monitored to ensure that all services are returned to normal
12:55 UTC | 05:55 PT
We are observing improvements in Explore performance. More updates to follow
11:09 UTC | 04:09 PT
We are observing stability on Support, Chat and Guide. We are continuing to monitor Explore closely, and are still working towards a full resolution
10:00 UTC | 03:00 PT
We are continuing to monitor to ensure all services are returned to normal
09:00 UTC | 02:00 PT
We’ve identified the issue, and things are improving for Support, Chat and Guide. We’re continuing to monitor.
08:14 UTC | 01:14 PT
Our engineers are continuing to work towards a fix. We’re seeing improvements, but still monitoring.
07:20 UTC | 00:20 PT
Our team continues to investigate access issues in Support, Guide, and Chat across multiple Pods. We appreciate your patience while we work on this. More to come...
06:44 UTC | 23:44 PT
We're investigating issues accessing Support on Pod 25. Our team is monitoring as things improve; we will provide more information soon.
Root Cause Analysis
The root cause of this incident was a misconfiguration in a scheduled service migration to an optimized routing gateway which caused an unexpected traffic increase to the Zendesk authentication service in pod 25. The misconfiguration combined with traffic growth in pod 25 pushed the authentication service to its scale limit, causing it to fail over and for a percentage of requests to fail.
Resolution
This issue was fixed by scaling up infrastructure, redeploying the authentication service, and correcting the configuration of the routing gateway to reduce traffic sent to the authentication service. Explore paused and restarted processing to sync accounts to restore expected data refresh.
Remediation Items
- Increased capacity for our authentication service on pod 25,
- Simplify authorization configuration within our routing gateway,
- Investigate rate-limiting authentication service to avoid becoming overwhelmed,
- Created additional monitors and alerts for authentication service errors, logging responses received, and incoming traffic volume,
- Investigate ways to increase resiliency of Explore infrastructure to handle higher rates of errors.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.
1 Comments
Postmortem published July 13, 2020.
Article is closed for comments.