SUMMARY
On August 31, 2021 between 14:29 UTC and 14:33 UTC some Zendesk customers hosted in the US West region (Pod 13 & Pod 20) experienced about 4 minutes service degradation and high error rates accessing Zendesk Support products.
Between 18:00 UTC and 22:10 UTC, some Zendesk customers in the same US West region (Pod 13 & Pod 20) experienced a brief period of service degradation and errors from Support, Guide, Billing and Talk products impacting about 10% of traffic in the region.
15:05 UTC | 8:05 PT
The team has applied a fix for the issue regarding the Server Error message customers on Pod 20 were facing. We are seeing stability and improvement in functionality. Please let us know if you continue having issues.
14:16 UTC | 7:16 PT
We have identified the likely cause of the Server Error message for customers on Pod 20, and are working on solutions to fix it. We will provide more information as it is available.
13:53 UTC | 6:53 PT
We are receiving new reports from customers on Pod 20 seeing Server Error messages when trying to access Zendesk. Investigation is underway.
POST-MORTEM
Root Cause Analysis
This incident was caused by connectivity issues on a single AWS Availability Zone within the US-WEST-2 Region impacting their Network Load Balancers and NAT Gateway service.
AWS Summary
Beginning at 10:58 AM PDT, AWS experienced network connectivity issues for Network Load Balancer, NAT Gateway and PrivateLink endpoints within the US-WEST-2 Region. At 2:45 PM, some Network Load Balancers, NAT Gateways and PrivateLink endpoints began to see recovery and by 3:35 PM, all affected Network Load Balancers, NAT Gateways and PrivateLink endpoints had fully recovered. The issue has been resolved and the service is operating normally.
AWS Root cause
A component within the subsystem responsible for the processing of network packets for Network Load Balancer, NAT Gateway and PrivateLink services became impaired and was no longer processing health checks successfully. This resulted in other components no longer accepting new connection requests, as well as elevated packet loss for Network Load Balancer, NAT Gateway and PrivateLink endpoints
Resolution
We identified and confirmed the root cause as networking issues in a single Availability Zone (AZ) impacting our Network Load Balancers and NAT Gateway services. Action was taken by our team to redirect default network routes for Pod 13 and Pod 20, enabling us to evacuate the faulty AZ and restore services.
Remediation Items
- Revisit runbooks for faster recovery.
- Improve monitoring and alerting for faster response time.
- Automate the AZ evacuation procedures.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.