15:55 UTC | 08:55 PT
We have resolved issues that resulted in loss of access and error messages for some Chat users and access has been restored. Thank you for your patience!
15:17 UTC | 08:17 PT
We have identified the issue resulting in access problems and error messages for some Chat users. Our team is currently working to remediate the overall issue and will have an update with more info as it becomes available.
14:33 UTC | 07:33 PT
We are currently working through a Chat service incident. Users impacted might be experiencing issues with login in and getting receiving error messages. Please bear with us!
14:12 UTC | 07:12 PT
We received reports of issues with our Chat. Our Engineering team is working on investigating the cause. Apologies for the inconvenience.
On October 23, 2019 from 13:39 UTC to 15:14 UTC some customers using Zendesk Chat were unable to log in and other logged-in agents experienced Chat service degradation. This incident was a reoccurrence of the previous day’s Chat login and performance issues. Unfortunately, this second incident occurred before the proposed fix was thoroughly tested and deployed. However, this did confirm our findings and increased our confidence in the fix.
Root Cause Analysis
This incident was caused by a code defect in Zendesk Chat that under some specific circumstances caused certain operations to enter an infinite loop for a short time. This increasingly amplified the number of queries to a master MySQL database eventually overwhelming it. This degraded database performance and led to login and performance issues for customers.
To fix this issue, we first attempted to kill culprit database queries. When those attempts failed to resolve the issue, our DBA team restarted the master MySQL cluster which resulted in full service recovery.
- [Completed] Application fix that addresses the race condition which resulted in high CPU on the database.
- [In Progress] Investigate further partitioning of Chat’s infrastructure to reduce blast radius from database slowness.
- [Not Started] Application fix which further improves the inter-service communication and error handling.
- [Not Started] Centralize circuit-breaking mechanisms to ensure uniform circuit open/close behaviour.
- [Not Started] Improve observability of MySQL queries hitting the Chat database.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.