Service Incident - October 22nd, 2020 - Pod 25 Service Outage and degradation

SUMMARY

On October 22, 2020 at 2:41 UTC, customers on Pod 25 started encountering server errors when accessing Zendesk Support, Talk, Guide, Sunshine, and Chat through Zendesk Support.

Timeline

05:44 UTC | 22:44 PT
We're happy to report that all issues impacting Pod 25 have been resolved. Thank you for your patience.

05:29 UTC | 22:29 PT
We are observing stability and are continuing to monitor to ensure all services are returned to normal. Thanks for your patience.

04:41 UTC | 21:41 PT
Our engineers continue to work through the remaining issues on Pod 25. Next update in one hour or when resolved; thanks for your patience today.

04:06 UTC | 21:06 PT
We are observing recovery for most customers, though errors may still occur in Search and other services on Pod 25. We are working on fixing the remaining issues. Next update in 30 minutes.

03:41 UTC | 20:41 PT
We continue to investigate server errors and performance degradation on Pod 25. We've observed some recovery but continue to work towards full recovery. More to come.

03:11 UTC | 20:11 PT
We are currently investigating connectivity issues causing performance degradation on Pod 25. More updates to follow.

Root Cause Analysis

Our automated monitoring systems immediately informed Zendesk’s engineering teams of the problems. Zendesk’s engineering team investigated and discovered that the connectivity failures and errors were caused by AWS interconnectivity issues in the region, as confirmed by the AWS status page report.

Resolution

AWS resolved their network connectivity issues at 03:41 UTC when most of our services recovered, as confirmed by some Zendesk customers. Despite this, some customers continued to experience higher than normal error rates on search and when accessing archived tickets. This was due to “degraded performance for some EBS volumes within the affected Availability Zone” (as reported by AWS). The EBS issue blocked our team from mounting a number of storage volumes to the Riak cluster that is essential to our ticket archiving service. Zendesk engineering teams worked directly with AWS engineering to fully resolve the EBS volume mounting issue at 05:17 UTC. The team was then able to recover the Riak clusters and resume our ticket archiving services.

Separately, we experienced disabled reader nodes on two database clusters that did not automatically recover from the initial interconnectivity issues. This caused increased error rates for DB queries. Our DBA team had to apply manual DB recovery procedures and restored the DB clusters at 04:59 UTC.

We verified all services were back online with expected performance at 05:18 UTC.

Remediations

Work with AWS to understand why the Aurora MySQL databases did not auto-resolve after the network connectivity recovery, and have a mitigation plan in place.
Work with AWS on the EBS volume issue that impacted our Riak cluster recovery.

FOR MORE INFORMATION

For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.