Summary
On May 2, 2023 from 18:00 UTC to 18:30 UTC, a subset of Zendesk Chat customers experienced delays in sending and receiving chat messages in addition to delays in creating tickets after chat sessions were completed.
Timeline
19:01 UTC | 12:01 PT
Between 18:00 and 18:30 UTC on May 2, 2023, Chat customers may have seen delays sending and receiving messages, and delays creating tickets after finishing chats. Please let us know if you continue to experience any issues.
POST-MORTEM
Root Cause Analysis
This incident was caused by a node failure in our service provider’s caching infrastructure. As this occurred during a peak period, the remaining healthy nodes were unable to absorb the overflow traffic leading to CPU exhaustion on those nodes.
Resolution
To fix this issue, a failover process was initiated at 18:02 to bring a new, healthy node online. Shortly after this point, message workers started connecting to the new node, eventually leading to full recovery by 18:30.
Remediation Items
- Improve resiliency of caching systems to limit future impact of vendor hardware failures.
- Increase monitoring and alerting around node failures.
- Exploratory work to understand advantages of alternative caching systems.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.
1 Comments
Postmortem published May 18, 2023.
Article is closed for comments.