02:46 UTC | 18:46 PT
Service incident related to POD 20 server errors is now resolved. We will publish a post mortem on this service incident as soon as investigation is completed.
02:26 UTC | 18:26 PT
Customers on Pod 20 may experience a high rate of errors and slowness using Support, Guide & Talk. Our teams are investigating, we apologise for the inconvenience.
On February 9 2019, between 1:32 UTC - 2:07 UTC, users on Pod 20 received high error rates and slowness using Support, Guide, and Talk.
This incident was the result of an AWS outage in the US-WEST-2 (Oregon) Region. The root cause of the issue was an AWS database software update to the Aurora storage layer. This operation resulted in a higher than anticipated load on the storage control system, which in turn impacted its ability to create new databases. This also also impacted the renewal of leases to encryption keys for existing databases, as over time these leases expire resulting existing databases becoming unavailable when the renewal operation fails. These lease renewal failures caused some databases which had already been created to become unavailable. To resolve the issue, AWS added additional capacity and moved create API calls to another storage cell in the same region that was not impacted by the event to further accelerate recovery.
AWS is taking multiple steps to prevent recurrence of this issue, including: reducing the maximum size of a "deployment domains", so that the load on storage control system resulting from
encryption key lease refreshes during deployment is reduced, freezing all updates to the Aurora storage management software until they make further improvements to how these updates are applied, tuning their alarming thresholds to alert engineers of any database connectivity issues more quickly, implementing auto-rollback procedures based on the fine-tuned alarm thresholds so that any customer impact is minimized significantly, tune auto-scaling parameters so that storage control system can handle surges in API call volume automatically, and implementing enhancements to the protocol between database and the control plane to handle transient failures more effectively. We are working with AWS to ensure that remediation items are completed to prevent this from happening again and also exploring potential solutions to failover when an AWS region is unavailable due to a service outage.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.