SUMMARY
The data tier for the Side Conversations system, was transitioned to use Spinnaker for deployments. This involved starting up a duplicate set of containers with more consistent labels, and this set was provisioned with a small number of replicas to start with. On August 23, all production services were reconfigured to use the new containers to serve data, and the system was observed to be stable and behaving as expected. On August 27, customers started to report delays in delivery of side conversations.
Timeline
14:36 UTC | 07:36 PT
We are happy to report the issue causing messages to be delayed in Side Conversations for customers on POD 17 have been resolved. Thank you for your patience!
13:56 UTC | 06:56 PT
We continue to work to remedy the cause of delayed messages in Side Conversations for customers on POD 17. More updates in an hour.
13:00 UTC | 06:00 PT
Our teams identified the root cause of delayed messages in Side Conversations for customers on POD 17, and are working towards a resolution. More updates as we have them.
12:32 UTC | 05:32 PT
Our teams are investigating delays impacting Side Conversations for Support customers on Pod 17. More info shortly.
Post-Mortem
Root Cause Analysis
This incident was caused by a misconfiguration error in configuring the correct number of replicas for data services.
Resolution
Once root cause was identified, we immediately increased the number of replicas. This returned response rates to normal, stopped the job-failing pattern, and allowed the workers to quickly catch up. The incident was declared all clear at 14:09 UTC.
Remediation Items
- Increase number of replicas
- Improve metrics and monitors for CPU usage in this service
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.