SUMMARY
On March 1st, 2021, from 17:00 UTC to 18:07 UTC, Zendesk experienced performance issues with Support ticket creation and updates in Pod 23.
Timeline
18:35 UTC | 10:35 PT
We're happy to report that the issue causing increased errors in Support on Pod 23 has been resolved. Apologies for any inconvenience caused.
17:59 UTC | 09:59 PT
We are currently investigating increased error rates affecting Support on Pod 23. This may result in ticket creation or update errors.
POST-MORTEM
Root Cause Analysis
This incident was a result of a software change that is part of a long-term project to increase Support and Guide reliability. In order to facilitate data replication, as part of that project, a backfill job was introduced which relies on scanning items in a shared Redis database cluster.
The background job caused unexpected CPU load on the shared Redis clusters, resulting in delayed read and write requests, with some failing due to time outs. During this time, affected services (primarily ticket create and update) were intermittently failing.
At 17:10 UTC, Internal monitoring alerted the operations team that CPU utilization was high on the Redis clusters, with replication delays.
Resolution
To fix this issue, we stopped the newly introduced data replication workers, which were scanning the Redis database.
- The Redis replication buffer/timeout was increased so the writer/reader(s) synchronization could complete.
- One Redis reader node (we maintain two) was taken out of rotation to alleviate resource contention from background synchronization.
At 18:07 UTC, Full service availability was restored and several remediation items were immediately identified (see Zendesk Actions below).
Remediation Items
- Rollback the background data replication job for the Support/Guide migration
- Deploy a code change that allows ticket create/update operations to succeed when Redis is under duress
- Review and enhance alerting on Redis performance
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.