On September 16, 2019 from 17:58 UTC to 19:18 UTC customers on Pod 14 saw heavily degraded performance across all Zendesk products and services, followed by an outage.
20:10 UTC | 13:10 PT
We've confirmed that the issues affecting accounts on Pod 14 and the JIRA integration stabilized at 19:18 UTC and are now resolved. We apologize for the disruption this caused.
19:21 UTC | 12:21 PT
We are beginning to see recovery on Pod 14 and are continuing to monitor performance.
18:46 UTC | 11:46 PT
Our team has identified the root cause of the outage on Pod 14 and is working to remediate the issue. The JIRA integration on all pods is also affected by this outage.
18:17 UTC | 11:17 PT
We are investigating an outage on Pod 14. Our Ops team is aware of the issue and is working to return functionality.
Root Cause Analysis
This incident was caused by a configuration issue within our deployment tool that caused infrastructure resources to be deleted before new ones were created. Manual intervention followed to rebuild capacity, and our attempts to manually remediate the issue conflicted with our vendor’s automated processes which prevented a quick rollback.
To fix this issue, we rolled out a previous known good configuration and manually corrected each node to ensure it was shown as available and manually increased capacity to speed up cluster recovery. Ensuring the recovering services were prioritized correctly so critical dependencies came up first, fully restoring capacity to the pod.
- Improve existing implementation process for changes that may trigger deletion of resources
- Update process and scope of chaos testing
- Improve tooling and systems to allow for automatic repair and recovery
- Update architecture to minimize impact in the event of future incidents
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.