On July 17, 2019, from 18:55 UTC to 19:38 UTC, Zendesk Talk customers on all Pods were experiencing dropped calls.
20:38 UTC | 13:38 PT
We're happy to report that the issues affecting Zendesk Talk stabilized at 19:38 UTC and are now resolved. We apologize for the disruption.
20:03 UTC | 13:03 PT
We are continuing to investigate the root cause of performance issues impacting Zendesk Talk across all pods. We will provide another update in 30 minutes.
19:23 UTC | 12:23 PT
We are currently investigating issues impacting Zendesk Talk on multiple pods. We will provide further information shortly.
Root cause Analysis
To keep up with growth, our VoIP provider started increasing their event processing fleet size. This resulted in an additional connection load on the billing database.
At the same time as the fleet increase, additional traffic was migrated to this database that resulted in additional reads happening from the master database.
This increased overall load on the master database removed the headroom the master database typically has.
When there were spikes in customer traffic, the master database exceeded it's connection limits and started dropping connections.
While our provider was doing the needful to re-establish service, we took a number of mitigating measures including graceful degradation (temporary routing to voicemail) and retries on failure.
Remediation Items (from our provider)
- Revise our capacity planning to better account for increases in connections
- Upgrade the database version resulting in efficiency gains
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.