Service Incident - March 7th, 2021 - Network Connectivity Issues on Pod 15

SUMMARY

On March 7, 2021 from 16:16 UTC to 18:15 UTC, Pod 15 customers in certain regions experienced degradation or an inability to access their accounts across all Zendesk products.

Timeline

21:36 UTC | 13:36 PT
We’re happy to report that the network connectivity issues impacting some customers on pod 15 have been resolved and latencies reduced to normal levels. Apologies for any inconvenience caused.

20:23 UTC | 12:23 PT
Our services on pod 15 remain stable while we work with our providers toward full resolution. We'll provide our next update in one hour.

19:21 UTC | 11:21 PT
Our services on pod 15 remain stable with an increase in latency while our providers continue to mitigate the underlying networking problems. Our next update will be one hour.

18:41 UTC | 10:41 PT
We have taken steps to mitigate the network connectivity issues some customers may have experienced on Pod 15 and are seeing improvement. We are continuing to work with our providers to fully remediate the issue.

18:07 UTC | 10:07 PT
We continue to investigate network connectivity issues affecting some customers on pod 15. We'll provide more info as it becomes available.

17:28 UTC | 09:28 PT
We're currently investigating network connectivity issues that may affect some customers on pod 15. We're working with our providers to learn more.

Root Cause Analysis

Zendesk’s cloud service provider experienced an internet connectivity issue that was caused by their external provider.

Resolution

At 17:24 UTC, our cloud service provider informed us they had started shifting traffic away from their external provider to recover from the incident. At 18:11 UTC, Zendesk teams went forward with a mitigation plan to reroute dynamic traffic for customers on Pod 15. This dramatically improved access to all customers, though introduced higher latency due to cross region traffic routing. At 19:00 UTC, our cloud service provider resolved the issue and connectivity was fully restored.

At 21:06 UTC, we reverted the earlier mitigation of rerouting traffic. At 19:07 UTC, traffic was successfully rerouted back to Pod 15 and we continued monitoring our system's health.

Remediation Items

Discuss root cause analysis with our cloud service provider and define preventative actions to more quickly route traffic away from bad networks
Improve our incident detection time for regional failures
Investigate fallback mechanisms for attachment serving
Investigate our ability to change routing behavior when bad networks appear

FOR MORE INFORMATION

For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.