23:45 UTC | 16:45 PT
We are happy to announce the social media issues impacting POD 3 are now resolved.
22:56 UTC | 15:56 PT
We are continuing to dig into the root cause of the queue affecting social media tickets. Some improvements may be noted.
22:26 UTC | 15:26 PT
We’re still narrowing down the cause of the queue issue. You may see delays in social media ticket creation on POD 3 for the time being.
21:58 UTC | 14:58 PT
We’re still investigating the root cause affecting the queue delaying social media ticket creation on POD 3. More information to come.
21:33 UTC | 14:33 PT
We are investigating a resurgence in queues affecting creation of POD 3 social media tickets.
During this incident, customers on pod 3 experienced social media and channel framework ticket and comment conversion halting. This was the result of the channels queue size growing with an increasing rate, and backing up to the point that the conversion process was stopped. An initial precaution was to restart the resque pool which did not improve performance. Once we confirmed that the restart was ineffective, we used kill switches on channels to stop polling. After that, the queue recovered and conversion processing started again. Once the queue size went back to normal, the first service incident was marked as resolved. About two and a half hours later, our operations noticed the issue was recurring, and the second service incident was called. Once again, we used the kill switches but even with that, the recovery was slow. Our operations team increased the number of resque workers in response, which resolved the queue size issue and the system recovered. We also identified a few accounts with long running queries which we then blocked. Finally, we noticed delayed response times from one of our channel providers which further contributed to the incident. In order to prevent this from happening again in the future, we will be working on more advanced logic to identify and mitigate channel queue bottlenecks, through better monitoring of queues, rate limiting integration services and load balancing queues.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.