On September 18, 2019 from 13:00 UTC to 14:42 UTC customers on Zendesk Support, Guide & Talk on Pods 14, 19 and 23 experienced dropped calls on Talk and degraded performance.
15:11 UTC | 08:11 PT
We've confirmed that the issues affecting performance on pods 14, 19, and 23 are now resolved. If you continue to experience issues, please let us know.
14:42 UTC | 07:42 PT
We continue to investigate the performance issues impacting customers on Pod's 14, 19 & 23. We will provide a further update in one hour.
14:05 UTC | 07:05 PT
We continue to investigate the performance issues impacting customers on Pod's 14, 19 & 23. Please accept our apologies for the disruption to your Zendesk service.
Root Cause Analysis
This incident was caused by a capacity limit (network bandwidth exhaustion) being reached in DNS cache. This capacity limit was hit due to increased query volume from a platform migration process that had previously rolled out to every other pod running our classic application.
The cause was not immediately clear though, and initial investigation into our infrastructure configuration datastore delayed the identification of the source of the issue.
Instance upgrades were required to increase network capabilities. When this work was completed at 14:45 UTC, all services were restored.
- Improve operational monitoring to reduce time to detect internal DNS infrastructure issues
- Tune DNS client configuration to reduce query volume
- Scale our internal DNS resources to increase capacity
- Investigate updates to internal DNS architecture
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.