On February 3, 2022 from 10:14 UTC to 11:36 UTC, customers on POD 17 experienced performance issues and slow loading of the Support platform interface.
11:05 UTC | 03:05 PT
We are investigating reports of slow performance in Support for our customers on POD 17. More updates to follow.
11:19 UTC | 03:19 PT
We continue to work on addressing slow performance for our customers on POD 17. We will provide a further update in 30 minutes.
11:40 UTC | 03:40 PT
We are seeing improvements for the slow performance impacting Support customers on POD 17 and continue to monitor behavior. We will provide further updates as soon as possible.
12:40 UTC | 04:40 PT
We are happy to report that the issues causing slow performance in Support for our Pod17 customers have now been resolved. We apologize for the inconvenience.
Root Cause Analysis
This incident was caused by capacity issues. These were caused by the reboot of two router gateway host applications that were not able to go back live afterwards and caused the system to overload.
This meant that requests sent from the same availability zones were unevenly distributed, with the healthier one getting many more requests though unable to process them and causing said capacity issues within the redistribution points.
To fix this issue, the code was bumped to a higher version that allowed the missing hosts to come back online, helping the requests to be distributed properly across all the hosts and availability zones.
- Add more intermediaries between endpoint devices. [Done]
- Increase alerts for errors on the above mentioned. [Done]
- Test further different behaviors for capacity spread across availability zones to observe how the NLBs balance traffic. [Prioritized]
- Adjust retries to avoid system overload. [Prioritized]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.