On November 12th, 2019 from 07:38 UTC to 08:32 AM UTC, our Support, Guide, Talk, Chat & Explore products were impacted by network connectivity issues and increased error rates in one of the 3 Availability Zones (AZ) that Pod 18 is distributed across. The degree of service degradation differed across the different products and fluctuated throughout the incident as systems self-healed.
14:26 UTC | 06:26 PT
We have concluded our investigation and can confirm that Pod 18 has been performing as expected since 08:29 UTC.
09:37 UTC | 01:37 PT
We can confirm that service is restored for Pod 18 customers as of 08:29 UTC. Our teams will continue to monitor.
08:41 UTC | 00:41 PT
We are seeing improvement for Pod 18 customers. Our teams continue to monitor. Please reach out to email@example.com if you continue to experience issues.
08:19 UTC | 00:19 PT
Our teams are investigating an outage impacting Pod 18 customers. We will provide an update shortly.
Root Cause Analysis
This incident was the result of an AWS incident impacting the eu-central-1 availability zone.
To fix this issue, At 07:46UTC our Kubernetes cluster in eu-central-1 automatically increased capacity in the 2 healthy Availability Zones and started to move capacity away from the unhealthy AZ. The majority of our database instances within the unhealthy AZ automatically completed failovers by 08:00UTC, with full recovery taking until 08:30UTC.
- Improve the resilience of Availability Zone failures for our MySQL ID allocation service.
- Investigate ways to decrease impact time when an Availability Zone is unhealthy.
- Investigate ways to reduce Search latency when an Availability Zone is lost.
- Replicate this Availability Zone failure in our staging environment to further improve recovery steps.
- Review our hosting providers analysis and further remediation items after our internal postmortem.
- Validate and enhance cross partnership incident response.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.