SUMMARY
On September 22, 2022 from 14:00 UTC to 16:25 UTC, the majority of Zendesk customers with Sunshine Conversations may have experienced delays in messaging.
Timeline
14:31 UTC | 07:31 PT
We are currently aware of an issue that is causing delays in Messaging from Sunshine Conversation on Pod 23. We are working to understand the cause of this and will keep you updated as we find out more. Thank you for your patience and apologies for the inconvenience.
14:44 UTC | 07:44 PT
We’re continuing to investigate the cause of delays in Messaging from Sunshine Conversations on Pod 23. We will provide another update in 30 mins or when we have more information to share.
15:14 UTC | 08:14 PT
Our team is still investigating the issue that is causing delays in Messaging from Sunshine Conversation on Pod 23. We will provide further updates when we have additional information to share.
15:35 UTC | 08:35 PT
We are happy to report that the delay in Messaging through Sunshine Conversations has been resolved. This incident impacted customers whose Sunshines Conversations account is hosted on Pod 23. We are continuing to monitor and will provide updates as they become available.
16:31 UTC | 09:31 PT
The incident that caused a delay to Messaging through Sunshine Conversation is now fully resolved. Thank you for your patience and collaboration
POST-MORTEM
Root Cause Analysis
An automatic system management process to perform critical patches from our cloud networking provider caused a degraded, out-of-memory state, that intermittently prevented Sunshine Conversations from publishing or consuming messages in queues.
Resolution
An automated rolling restart of the system allowed for Sunshine Conversations to start processing the backlog of messages and an increased capacity increased the rate of messages that could be processed.
Remediation Items
- Investigate the abnormal memory usage upon rolling restarts of the queuing system cluster
- Follow up with our cloud provider on why the memory exhaustion occurred on one of the remaining nodes in the cluster
- Investigate the reconnection logic issue faced by a small subset of the application workloads when the queueing system is degraded by out of memory events
- Adjust maintenance windows scheduled events to occur outside local business hours
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.
1 Comments
Post-Mortem published September 29, 2022
Article is closed for comments.