SUMMARY
On December 29, 2022 from 14:53 UTC to 16:26 UTC, customers using Sunshine Conversations, Messaging in Agent Workspace, Answer Bot and Support Widget across All Pods experienced delays with incoming and outgoing messages.
Timeline
15:58 UTC | 07:58 PT
We are aware and currently working on resolving issues with delayed incoming and outgoing messages from/to any messaging channel as well as delayed widget loading and Answer Bot failure to deliver and receive messages. More info as we have them.
16:22 UTC | 08:22 PT
We appreciate your patience while we continued working to fully resolve the issue affecting customers across all Pods with delays in Messaging, Answer Bot and Widget in Support and Sunshine Conversations. We are seeing improvements in the loading of messages.
16:54 UTC | 08:54 PT
We are happy to report that the issue causing delays in Messaging, Answer Bot, Support Widget, and Sunshine Conversations has been resolved. Our backlog has caught up and no messages were lost. Thank you for your patience.
POST-MORTEM
Root Cause Analysis
This incident was caused by the server management and monitoring cluster tool hosting the Whatsapp messages queue being degraded and unable to process those. On Dec 27, the node reached a high memory usage and was auto-restarted, however, the server management and monitoring tool support team checked that it had not fully recovered since then. Therefore the console on our end was showing it was fully recovered when it had not, consequently making this invisible to us at the beginning - being noticeable only when the jobs queue started increasing and we received alerts for it.
Resolution
To fix this issue, we first redeployed the node version. That did provide some help as workers were able to process a small number of dispatch messages jobs, yet their capacity was rapidly filled by faulty WhatsApp jobs again. Another node was consuming high memory, so we also restarted it and as a result, the console was showing that the node was not running. At that point, we reached the server management and monitoring tool support team and they had to restart both of the previous nodes once again on their end. That's when we started to see a full recovery and the backlog was then completely processed at 16:26 UTC.
Remediation Items
- Upgrade server versions.
- Review monitoring alerts.
- Prevent server management and monitoring cluster issues from impacting the whole platform.
- Improve logging for messaging clients in Sunshine Conversations when the messaging cluster is not reachable.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.
1 Comments
Post-mortem published January 16, 2023.
Article is closed for comments.