Summary
On December 22, 2022 from 04:19 UTC to 07:47 UTC, Zendesk customers may have experienced latency in the delivery of messages in multiple Social Messaging channels.
Timeline
06:46 UTC | 22:46 PT
We are aware of latency issues causing slowness in sending and receiving messages in Chat and Social Messaging channels. Investigations are underway, and we will provide another update when we have more to share. Thank you for your patience in the meantime.
07:15 UTC | 23:15 PT
Investigation is ongoing on latency issues causing slowness in sending and receiving messages in Social Messaging channels on all pods. The root cause is currently being looked at. Next update in 30 minutes.
07:44 UTC | 23:44 PT
The root cause is still being looked into, latency issues are ongoing. Customers may continue experiencing delays with Social channels messaging on all pods for now. We will update in 30 minutes and seek your continued patience.
08:16 UTC | 00:16 PT
We see improvements in the backend for the processing of messages of Social messaging channels. End-user messages will be delayed in appearing in Zendesk and agent’s messages that failed to send will not be re-sent, so please make sure to check those where needed.
08:41 UTC | 00:41 PT
We’re happy to confirm that latency and deliverability issues related to Social messaging channels have been fully resolved. Please make sure to clear cache and cookies. Those messages should be correctly processed and delivered now. Apologies for any inconvenience caused.
Root Cause Analysis
This incident was caused by insufficient capacity in a database server on Pod 25. The impacted server did not fail over to a working server, leading to the degradation experienced by our customers. The investigation time was extended during this incident due to the lack of logs and some additional anomalies that prevented focus on the correct resolution pathway.
Resolution
To fix this issue, our team initiated a database cluster upgrade to increase capacity.
Remediation Items
- Review and tune messaging application circuit breaker.
- Investigate server capacity auto-scaling.
- Additional monitors and alerts for database CPU capacity.
- Update runbook to ensure smoother handling of future incidents.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.
1 Comments
Post-mortem published January 6, 2023.
Article is closed for comments.