Summary
On June 21, 2022 from 06:27 UTC to 07:45 UTC, Zendesk customers in many geographical regions experienced an outage with multiple Zendesk products. Globally we saw total error rates rise to ~50% of all traffic. Errors had significantly dropped to 12% of all traffic by 07:15 UTC before full recovery was observed at 07:45 UTC.
Timeline
06:49 UTC | 23:49 PT
We are investigating errors accessing multiple Zendesk products across multiple locations. More to come.
07:13 UTC | 00:13 PT
Our service provider is currently working through a large network issue impacting multiple services across the internet. Zendesk products are impacted by this issue. We are working on failing over to an internal network as soon as possible. More to come.
07:26 UTC | 00:26 PT
We are observing recovery across all Zendesk products in all locations. We are continuing to monitor the situation until full recovery. Thanks for your patience today. We will provide another update soon.
08:06 UTC | 01:06 PT
Our service provider went through a large network issue impacting multiple services across the internet and has now implemented a fix for it. We see gradual recovery for all products and we are monitoring closely. We’ll update you in next one hour with more details.
09:03 UTC | 02:03 PT
We’re happy to inform that our service provider has confirmed the resolution of the issue on their end and our customers are no longer seeing errors. There is no backlog for failed ticket creation on our side. Thank you!
Root Cause Analysis
This incident was caused by a major outage experienced by Zendesk’s CDN partner.
Resolution
Zendesk was taking steps to bypass our primary CDN network when we started to observe recovery in traffic through the network. We decided to pause our bypass work shortly after we observed substantial improvement in the network. We closely monitored the situation until we had full recovery of Zendesk services.
Remediation Items
- Follow up with our CDN partner for full Root Cause Analysis of impact on their platform [Completed]
- Evaluate the feasibility of certain Zendesk products bypassing the CDN network in similar situations [Scheduled]
- Automate disaster recovery to reduce the dependency on engineers [Scheduled]
- Explore steps to speed up global disaster recovery process [Scheduled]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.
1 Comments
Postmortem published June 27, 2022.
Article is closed for comments.