SUMMARY
On October 15, 2020 from 06:44 UTC to 08:22 UTC, a small number of Zendesk Chat customers experienced server errors when attempting to access and use various features within the product.
TIMELINE
08:53 UTC | 01:53 PT
We received reports that Chat was down for some customer and we have now identified the issues that caused the degradation earlier today - however they have now been fixed. Should you continue having any issues, please let us know.
12:16 UTC | 05:16 PT
We experienced elevated Server errors from 6:44 AM to 8:22 AM UTC, when a fix was deployed and Chat availability was back to normal. Service has been restored and a summary of our post-mortem investigation will be posted here https://zdsk.co/3lRrbvs.
POST-MORTEM
Root Cause Analysis
This incident was caused by locking errors on a Chat database resulting from a misconfiguration in a recent code deploy. These locking errors caused connection spikes and CPU exhaustion on the Chat database leading to the errors experienced by customers.
Resolution
To fix this issue, our engineering team reverted the database locking change and restarted application servers. However, recovery was delayed due to multiple ongoing deployments. The deployments were stopped and we rolled back to the last known working version of the application. We then restarted services affected and observed full recovery.
Remediation Items
- Implement process change to improve QA for Chat Ops changes [Scheduled]
- Migrate database locking away from current version [Scheduled]
- Implement alternative database locking mechanisms [Scheduled]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.