Summary
On June 15, 2022 from 10:21 UTC to 11:38 UTC, Zendesk customers in the India region experienced increased network errors affecting their usage of all Zendesk products.
Timeline
11:31 UTC | 04:31 PT
We have received reports of users receiving 502 Bad Gateway error. We are working with our CDN provider to mitigate the issue. Thank you for your patience and apologies for the inconvenience.
11:48 UTC | 04:48 PT
A fix for the issue causing the 502 Bad Gateway errors has been deployed. Thank you for your patience while we investigated this. Please refresh the page and try again.
Root Cause Analysis
This incident was caused by network connectivity issues in our CDN partner’s network. Our partner observed performance and reliability issues in some data centers across India, Indonesia and Eastern Europe. A database cleanup of old firewall rulesets, that were still referenced in their internal firewall rules, resulted in a fallback scenario. This triggered underlying software components to emit log messages to warn about the issue. The volume of log messages emitted overloaded our provider’s log processing system, which in turn caused request processing to be impacted in the form of higher latency and errors. The connectivity errors were detected by Zendesk’s client-side monitoring. However, one of our alerts incorrectly signaled that the impact had subsided while the impact was still ongoing. This delayed our incident response and extended the impact window.
Resolution
During 2022, Zendesk engineering has been investing heavily in disaster recovery mitigations to protect our customers from such an impact. We leveraged these processes to remove the vendor from the critical path of traffic for our customers in India. This fully mitigated the impact for Zendesk’s customers.
Remediation Items
- Increase alert urgency to page the correct teams earlier.
- Resolve cause of Zendesk’s monitoring alerts prematurely indicating end of impact.
- Automate disaster recovery mitigations so as not to require engineer involvement.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.