SUMMARY
On August 31, 2021 between 14:29 UTC and 14:33 UTC some Zendesk customers hosted in the US West region (Pod 13 & Pod 20) experienced about 4 minutes service degradation and high error rates accessing Zendesk Support products.
Between 18:00 UTC and 22:10 UTC, some Zendesk customers in the same US West region (Pod 13 & Pod 20) experienced a brief period of service degradation and errors from Support, Guide, Billing and Talk products impacting about 10% of traffic in the region.
Timeline
19:15 UTC | 12:15 PT
We are investigating reports of slow performance, dropped calls, and possible timeouts, as well as errors in Explore, Guide, and Admin Center on Pods 13 & 20. We will provide additional updates as we learn more.
19:58 UTC | 12:58 PT
We continue to investigate a number of performance issues on Pods 13 and 20 including slow performance, delayed inbound emails, Talk degradation, and errors on Guide and the Admin Center. We will provide additional information as it becomes available.
20:28 UTC | 13:28 PT
As we continue to investigate, we are seeing improvements to performance issues on pods 13 and 20. We will provide additional information as it becomes available.
21:21 UTC | 14:21 PT
We are seeing improvement in the performance issues on Pods 13 and 20; there may be some lingering request delays as we recover. Please let us know if you continue to experience any disruptions to service.
22:19 UTC | 15:19 PT
While we see some improvement in general performance issues on Pods 13 and 20, Self-Service and Sales Assisted customers across all pods are unable to perform billing updates at this time.
00:49 UTC | 17:49 PT
General performance issues affecting Pods 13 and 20 are now resolved. All customers can now resume billing updates. Thank you for your patience and understanding.
POST-MORTEM
Root Cause Analysis
This incident was caused by connectivity issues on a single AWS Availability Zone within the US-WEST-2 Region impacting their Network Load Balancers and NAT Gateway service.
AWS Summary
Beginning at 10:58 AM PDT, AWS experienced network connectivity issues for Network Load Balancer, NAT Gateway and PrivateLink endpoints within the US-WEST-2 Region. At 2:45 PM, some Network Load Balancers, NAT Gateways and PrivateLink endpoints began to see recovery and by 3:35 PM, all affected Network Load Balancers, NAT Gateways and PrivateLink endpoints had fully recovered. The issue has been resolved and the service is operating normally.
AWS Root cause
A component within the subsystem responsible for the processing of network packets for Network Load Balancer, NAT Gateway and PrivateLink services became impaired and was no longer processing health checks successfully. This resulted in other components no longer accepting new connection requests, as well as elevated packet loss for Network Load Balancer, NAT Gateway and PrivateLink endpoints
Resolution
We identified and confirmed the root cause as networking issues in a single Availability Zone (AZ) impacting our Network Load Balancers and NAT Gateway services. Action was taken by our team to redirect default network routes for Pod 13 and Pod 20, enabling us to evacuate the faulty AZ and restore services.
Remediation Items
- Revisit runbooks for faster recovery.
- Improve monitoring and alerting for faster response time.
- Automate the AZ evacuation procedures.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.