SUMMARY
From June 3, 2021 23:34 UTC to June 4, 2021 01:13 UTC, Support customers across multiple Pods (Pods 15, 17, 18, 19, 23, 26, 27) were unable to view ticket updates or refresh ticket counts.
Timeline
02:31 UTC | 19:31 PT
We can confirm the issues experienced in Support on the following Pods are now resolved: 15, 17, 18, 19, 23, 25, and 27. Thanks again for your patience while we worked through this incident.
01:35 UTC | 18:35 PT
We have completed the rollback of the change that caused the errors in Support as of 01:13 UTC. We will monitor for 30 mins and provide a final update soon
00:59 UTC | 17:59 PT
We have identified the issue causing errors in Support. Our team is currently rolling back the change that caused the issue and working to fully resolve the issue. Thanks for your patience while we work through this issue.
00:38 UTC | 17:38 PT
We're investigating issues when trying to update tickets in Support on multiple Pods. More information to come.
Root Cause Analysis
This incident was caused by a deployed change which added support for etag (HTTP cache) response headers which resulted in validation errors while updating tickets or refreshing ticket counts in Views.
Resolution
To fix this issue, Zendesk engineers cancelled the deployment to remaining pods and started a new deployment to roll back to the previously released version. At 01:13 UTC, the final affected pod completed rollback, stabilizing the environment.
Remediation Items
- Improved our deploy dashboards to make these types of errors visible.
- Additional processes introduced for any middleware changes to reduce risk.
- Improved testing to catch similar issues before reaching production.
- Continuing to improve our ability to rollback faster by propagating changes to more replicas in parallel.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.