SUMMARY
On June 10, 2021 from 11:37 UTC to 11:39 UTC on Pod 18, and from 15:07 to 15:15 UTC on Pods 17 and 23, some of our customers experienced 500 errors in Support, Guide and dropped calls in Talk.
Timeline
03:58 UTC | 20:58 PT
The earlier issues impacting Pods 17 and 23 have been resolved. This is a separate incident to the one impacting Chat and Pod 18 customers. The update for that will come separately later. We thank you greatly for your patience today.
03:49 UTC | 20:49 PT
The earlier issues impacting Pods 17 and 23 have been resolved. This is a separate incident to the one where Chat and Pod 18 customers were impacted. The update for that will come separately later. We thank you greatly for your patience today.
17:55 UTC | 10:55 PT
Pods 17 and 23 remain stable. We continue to investigate and monitor the situation and will provide updates as we have new information or have deemed the issue fully resolved. If you encounter further issues, please let us know so we can investigate.
17:18 UTC | 10:18 PT
Pods 17 and 23 remain stable. We have determined pod 19 was not impacted. We continue to investigate and monitor the situation. If you encounter further issues, please let us know so we can investigate.
16:38 UTC | 09:38 PT
Pods 17, 19, and 23 have remained stable. We continue to investigate and monitor the situation. If you continue seeing issues, please let us know so we can investigate.
16:04 UTC | 09:04 PT
The errors across pods 17, 19, and 23 have subsided. We continue to investigate and monitor the situation. We will provide further updates as we have more information.
15:34 UTC | 08:34 PT
We are investigating increased error rates across pods 17, 19, and 23. This may result in dropped calls and errors within Support and Guide. We will provide updates as we have more information.
POST-MORTEM
Root Cause Analysis
This incident was caused by a change to our database that was part of a long running project to increase our systems reliability.
A change to the way data is written for Guide database tables, adding more row updates into each transaction, caused an exponential growth in update transaction events on our database services.
As a result of the high write activities in database clusters, slowness was caused in the database nodes.
This in turn also increased the size of database binary log files by multitudes, which then took longer and longer to read for our processes, causing excessive amounts of handling time for all requests running on these Pods.
Resolution
To fix this issue, our engineers first broke the cycle that caused the continuous updates.
After this, they increased our computing resources and maximised the number of updates in each transaction.
Once the service was then restarted, functionality returned to normal.
Remediation Items
- Create Monitoring specific to these database errors. [In Progress]
- Update our internal documentation and best practices. [In Progress]
- Created additional alerts [In Progress]
- Review and re-evaluate all current services using our shared database. [In Progress]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.
1 Comments
Post Mortem published 22nd of June 2021.
Article is closed for comments.