SUMMARY
On the 25th of November 2020, between 18:07 UTC and 19:02 UTC Zendesk Talk customers on all Pods encountered multiple issues including dropped calls, call connection and transfer delays.
Timeline
18:44 UTC | 10:44 PT
We’re investigating dropped calls and other Talk issues across all pods. Our team is actively investigating and will provide updates as we have more information.
19:15 UTC | 11:15 PT
We’re working with our Talk provider to continue investigating Talk issues across all pods. We’ll provide an update within the next hour.
21:43 UTC | 13:43 PT
We continue to engage with our voice provider who's working towards resolving their remaining issues. We have seen improvements, and are continuing to monitor the situation. We will provide another update when we have more information.
00:38 UTC | 16:38 PT
We’ve seen a significant improvement in call performance and are awaiting an update from our voice provider to confirm that their remaining issues have been fully resolved. We will provide a final update once we receive confirmation from our voice provider.
04:53 UTC | 20:53 PT
Our service provider has confirmed that all issues have been resolved. Thank you for your patience.
POST-MORTEM
Root Cause Analysis
This incident was caused by an internal service failure in our service provider’s infrastructure. The internal service failed to handle a downstream 3rd party outage, resulting in the service exhausting its system resources. When resource exhaustion reached the tipping point, this led to an inability to respond to call fetch API requests.
Resolution
To fix this issue, our service provider replaced the failed internal service nodes and increased service capacity. Once these steps were taken, the call drops and delays stopped.
Remediation Items
To prevent recurrences of this issue in the future, our service provider is planning to take the following steps:
- Implementation of more robust monitoring for this specific type of issues. [Scheduled]
- Build more processes and changes around this type of issue to improve robustness and their own response time to these. [Scheduled]
Zendesk Engineering has committed to the following remediation actions to attempt to mitigate similar issues in future:
- Increase the allocated memory to these processes in our own services to create a substantial buffer for our customers for these events [Scheduled].
- Review and improve monitoring for the error associated to this type of incident [Scheduled].
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.