SUMMARY
On December 14, 2020 from 15:17 UTC to 23:30 UTC, Zendesk Support, Chat and Guide customers on Pod 19 experienced various issues such as incorrect View counts in Support, failed chat sessions in Agent Workspace and delays in processing Knowledge and ticket events in Guide. These symptoms were isolated to specific times within the incident window.
TIMELINE
17:40 UTC | 09:40 PT
We have identified an issue impacting both Views and Agent Workspace for customers on Pods 18 and 19. A fix is in place, and we are currently monitoring the results. More information to come as we learn more.
18:20 UTC | 10:20 PT
Our teams are still hard at work on resolving the issues impacting Pod 18 and 19 customers using Chat for Agent Workspace and Views. We are seeing some improvements, and we'll continue to provide updates as soon as we have them. We apologize for the inconvenience!
19:15 UTC | 11:15 PT
Our teams are still hard at work on resolving the issues impacting Pod 18 and 19 customers using Chat for Agent Workspace and Views. We are seeing some improvements, and we'll continue to provide updates as soon as we have them. We apologize for the inconvenience!
20:18 UTC | 12:18 PT
Our engineering team has deployed the fix for Views which should be loading now. Our teams are continuing to work on the issue accepting chats. We will post an update when we have more information.
21:24 UTC | 13:24 PT
We're continuing to work on the issue impacting customers' abilities to accept incoming chats. We've added more resources and optimizations and are seeing some improvements, but we're working toward a permanent fix. Thank you for your patience!
23:34 UTC | 15:34 PT
Our team is seeing services related to accepting chats in the Agent Workspace are now being restored to normal functionality, however please let us know should you be seeing this is not the case.
00:53 UTC | 16:53 PT
The issues with Views and Chat in Agent Workspace on Pods 18 and 19 have stabilized. We will provide a final update once the issue is fully resolved. Thanks again for your patience while we worked through today's issues.
03:25 UTC | 19:25 PT
We are happy to report that the issues impacting Views and Chat in Agent Workspace has been resolved. Further analysis reveals Pod 18 was not impacted. Please let us know if you still experience any issues. Thank you for your patience.
POST-MORTEM
Root Cause Analysis
This incident was caused by a failure in a messaging system used by multiple services in POD19. A bug in an internal bootstrapping script failed to remount the existing storage volume. As a result, an empty volume was attached to the new node and a large amount of data had to be synced across the cluster. As partitions finished syncing and taking back leadership on this node, the cluster reached IO capacity. This led to some rejected client requests and delays processing the backlogged messages to downstream services.
Resolution
To fix this issue, our engineers applied throttling to the new node and moved cluster leadership to another node until the syncing process was complete.
Remediation Items
- Create a monitor to alert when a message broker is identified to have “no data” and escalate immediately. [Completed]
- Improve the Chat event processor when restarting with a large backlog to shorten the downtime. [Scheduled]
- Implement improved error handling to remove failsafe mechanisms from the Chat system that result in duplicate missed chat tickets. [Scheduled]
- Increase Chat event processor observability to identify instances in a stale state or error prone configurations. [Scheduled]
- Implement broker rate limiting during node replacement. [Scheduled]
- Diagnose fault and improve EBS remounter functionality. [Scheduled]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.