19:18 UTC | 12:18 PT
Tickets from social media are now caught up and we are not seeing any further queueing issues.
Additionally, queued Twitter tickets should also have now been created.
We are progressing on ticket creation via social media on POD 3. Queued Facebook and Google Play tickets should be created at this point.
17:02 UTC | 10:02 PT
The investigation and monitoring of our service monitoring social queue messages is continuing. We will provide updates with any changes.
16:23 UTC | 9:23 PT
We are continuing to investigate and monitor performance of the service monitoring social media queue messages.
15:49 UTC | 8:49 PT
We've restarted the service responsible for queueing social media messages. We're continuing to investigate and monitor performance.
15:30 UTC | 8:30 PT
We are currently investigating reports social media messages not converting into tickets on Pod 3.
During this incident, customers on pod 3 experienced social media and channel framework ticket and comment conversion halting. This was the result of the channels queue size growing with an increasing rate, and backing up to the point that the conversion process was stopped. An initial precaution was to restart the resque pool which did not improve performance. Once we confirmed that the restart was ineffective, we used kill switches on channels to stop polling. After that, the queue recovered and conversion processing started again. Once the queue size went back to normal, the first service incident was marked as resolved. About two and a half hours later, our operations noticed the issue was recurring, and the second service incident was called. Once again, we used the kill switches but even with that, the recovery was slow. Our operations team increased the number of resque workers in response, which resolved the queue size issue and the system recovered. We also identified a few accounts with long running queries which we then blocked. Finally, we noticed delayed response times from one of our channel providers which further contributed to the incident. In order to prevent this from happening again in the future, we will be working on more advanced logic to identify and mitigate channel queue bottlenecks, through better monitoring of queues, rate limiting integration services and load balancing queues.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.