On September 13, 2019 from 15:22 UTC to 17:51 UTC customers on Pods 14, 19, and 23 experienced degraded performance or were unable to load Zendesk products and associated services.
18:44 UTC | 11:44 PT
We're happy to report that the performance issues affecting pods 14, 19, and 23 are resolved. Thank you for your patience.
18:01 UTC | 11:01 PT
Performance is recovering on pods 14, 19, and 23. We will continue to monitor the situation.
17:48 UTC | 10:48 PT
We have identified the cause of the performance issues affecting pods 14, 19, and 23. We do not have an ETA, but we are working toward a resolution.
17:18 UTC | 10:18 PT
We are currently addressing a critical outage of an internal system affecting Pods 14, 19, and 23. There is no ETA at this time, but we are working as quickly as possible.
16:50 UTC | 9:50 PT
We are diligently working to resolve the ongoing performance issues affecting pods 14, 19, and 23. Please note that http://status.zendesk.com is intermittently working.
16:22 UTC | 9:22 PT
Status.zendesk.com is also impacted by the ongoing performance issues with Pods 14, 19 and 23. Please keep an eye out here for status updates.
16:05 UTC | 9:05 PT
We're investigating performance issues for accounts in Pods 14, 19, and 23. We are working to identify a root cause and will post another update shortly.
Please note: During this incident our system status page (status.zendesk.com) was unavailable, impacting our ability to communicate effectively with our customers. A separate internal post-mortem will be held for this issue and additional remediation items will be added to this post-mortem once it is complete.
Root Cause Analysis
Following a vulnerability patch deployment to our infrastructure configuration datastore, an outdated configuration value was automatically set, causing the cluster to destabilize. This unknown stale configuration value caused the datastore to become inaccessible, requiring a manual intervention.
Our engineering team began investigation and identified the datastore as inaccessible and a plan was developed to recover the service. Downstream applications were no longer able to read critical service discovery data, which led to heavy service degradation across Zendesk products and all associated services. A documented procedure was followed for restarting cluster nodes, however the service was unable to recover. Upon further investigation the stale configuration value was identified as the culprit.
Once the stale configuration value was fixed at 17:33 UTC, the datastore recovered and applications were able to read configuration again. At 17:52 UTC, Zendesk services recovered as applications restarted and began accepting connections, stabilizing the cluster.
- Modify and document vulnerability patching procedure to decrease risk
- Investigate architectural changes for improved stability and performance
- Correct the faulty code identified in third party service
- Improve chaos testing coverage to improve high availability
- Create new alerts and monitoring in our logs for increased visibility
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.