On June 25, 2019 from 07:38 UTC to 09:22 UTC Pod 17 customers on Zendesk Support & Guide customers experienced a service degradation resulting in slow performance in Support, and timeout or slowness to redirect to their Guide from the "/" => "/hc" endpoint.
09:18 UTC | 02:18 PT
We are investigating performance issues for our customers on POD17. More information to follow
09:43 UTC | 02:43 PT
We apologise for the disruption to your Zendesk Support account on Pod 17. Our teams are working hard to identify the cause of the degradation. Next update in 30 minutes
10:17 UTC | 03:17 PT
Our teams have identified a potential root cause and continue to work to mitigate impact. Next update in 30 minutes
10:49 UTC | 03:49 PT
We've implemented a fix for the service degradation on Pod 17. Performance should be improving. Our team is monitoring.
Root cause Analysis
The Pod 17 autoscaler failed to scale up to handle additional traffic due to the autoscaler metrics server being unavailable due to an incorrect cpu quota setting. Once the metrics server started failing to scrape metrics, it was unable to respond to the autoscaler when asked for those metrics. The autoscaler then goes into a "skip scaling" state, where it will make no changes to the upscaling if any one of its metrics can't be fetched. Our technical teams identified the autoscaler as inactive and the cluster was undersized.
Once the autoscaler was identified as inactive and the cluster was undersized, our operations team then resolved the problem by manually increasing the number of minimum replicas to a value that could support the peak traffic volume.
- Implement monitoring and alerts for the autoscaler and metrics server.
- Review and update app server fleet monitoring and escalation process.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.