Update (04:36 UTC | 21:36PM PT)
We have restored our Chat service for all customers and have re-enabled the previous chat features that were disabled.
Update (21:00 UTC | 02:00PM PT)
We have restored our core Chat service for all customers but have not yet identified the root cause of this incident. To improve stability while our investigation continues, the following chat features have been disabled for all customers:
- Real Time Dashboard
- Real Time APIs
- Conversion Tracking
- Skills-based routing features.
We have implemented additional monitoring, staffing, and failover criteria to mitigate risk of service instability until the root cause is identified. We will provide additional status updates on progress toward full resolution and re-enablement of features in this Help Center article. http://zdsk.co/2t3I0fy We apologize for this disruption in service, and are committed to the return of all chat features as quickly as possible.
04:36 UTC | 21:36 PT
We’re happy to report that the issues affecting the chat services have been resolved.
03:01 UTC | 20:01 PT
We are still investigating the issue affecting the Zendesk Chat Widget. More updates to follow. Thank you for your patience.
02:04 UTC | 19:04 PT
We are currently experiencing issues with the Zendesk Chat Widget not appearing. We are investigating further. Updates to follow.
19:24 UTC | 12:24 PT
We are seeing improvement across all customers for the Chat issues. We're continuing to monitor stability. If persists, log out/back in.
18:50 UTC | 11:50 PT
We are seeing positive movement in the number of accounts that can access chat. We continue to work towards stabilization for all customers
18:19 UTC | 11:19 PT
We are actively working to bring infrastructure components back online. Currently some customers are active but may not be stable.
17:41 UTC | 10:41 PT
We are still working to narrow down root cause. A small segment of customers may see improvements while we troubleshoot and monitor.
17:04 UTC | 10:04 PT
We are narrowing in on the root cause and working to mitigate. Please stay tuned for further updates. We thank you kindly for your patience.
16:17 UTC | 09:17 PT
We continue to work to mitigate the current Zendesk Chat concerns. Please stayed tuned for further updates. We appreciate your patience!
15:34 UTC | 08:34 PT
We are in the process of re-enabling the processes responsible for logging into Chat. We will update in 30m or if there are any changes.
14:58 UTC | 07:58 PT
Our Operations team are working hard to bring the Zendesk Chat platform back online. We apologize for the inconvenience.
14:38 UTC | 07:38 PT
Due to the mitigation efforts, we will be initiating a unscheduled maintenance on Zendesk Chat, which will affect performance.
14:24 UTC | 07:24 PT
We have identified the problem with Zendesk Chat and working on deploying a fix. Thank you for your continued patience.
13:53 UTC | 06:53 PT
We are investigating increased traffic on our Zendesk Chat connection servers and are continuing to mitigate. More information to follow.
13:18 UTC | 06:18 PT
We are working on a resolution for the Zendesk Chat issues. More info to follow. Thank you for your patience.
12:56 UTC | 05:56 PT
We are progressing with our investigation on the issues with the Zendesk Chat Widget. More info to follow.
12:40 UTC | 05:40 PT
We are currently experiencing issues with the Zendesk Chat Widget. We are investigating further. Updates to follow.
A software data store relied on by Zendesk Chat experienced spikes in latency which resulted in a significant disruption of service. Investigation during the incident and rollback of recent changes did not identify a root cause or resolve the issue, so the team executed a controlled shutdown and restart of the service in batches of accounts while monitoring the latency to the datastore. While the root cause for the problem wasn't identified, additional monitoring was put in place to assist further investigation.
During the second event, the team was able to determine that the software data store was the source of the latency. The reason for the latency within this system is still not known, but we are actively working with AWS to address it.
Multiple remediation actions are in progress. Most importantly, a procedure to alter the membership of a problem data store cluster is now well understood. In addition, a parallel self-managed memcache cluster has been implemented and is available for failover. Changes to our application are also being made to prevent a similar issue from causing such a huge impact to core chat functions.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.