SUMMARY
Between 02:56 UTC and 04:26 UTC, we detected WhatsApp message delivery delays for customers on Pods 19 and 23 following a service provider issue. The backlog of messages have now been delivered and services are back to normal.
Timeline
04:40 UTC | 21:40 PT
Between 02:56 UTC and 04:26 UTC, we detected WhatsApp message delivery delays for customers on Pods 19 and 23 following a service provider issue. The backlog of messages have now been delivered and services are back to normal. We apologize for any inconvenience caused by this issue today.
POST-MORTEM
Root Cause Analysis
Between 02:56 UTC and 04:26 UTC, our cloud provider experienced issues, which prevented the SunCo workers to send messages into SQS and connect to the queue. Normally, the monolith should have been able to reconnect to SQS when the errors from it would have been gone (after a couple of minutes) but was unable due to the filling of its prefetch slots.
Resolution
Once we found the issue was with SQS and the SunCo workers, we tried restarting 2 of the workers . We immediately saw an improvement in the number of WhatsApp messages being processed and started restarting all of the remaining workers. A couple of minutes after this, all the messages backlog had been processed and the additional latency for the WhatsApp messages was gone.
Remediation Items
1. Add additional error monitoring and logging
2. Improve runbook for restarting workers
3. Investigate cause of workers getting stuck
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.