SUMMARY
On July 8, 2021 from 00:31 UTC to July 9, 2021 08:57 UTC customers initially reported login, ticket creation, and performance issues in Chat. Later this expanded to issues in Explore, with data sync being delayed. After a workaround was implemented for Chat, customers only experienced the Explore data sync delay for a while longer before the issue was fully resolved.
Timeline
10:25 UTC | 03:25 PT
We appreciate your patience while we addressed issues impacting Chat and Explore, as well as platform latency in general. We can confirm these issues have now been fully resolved.
07:51 UTC | 00:51 PT
We are still investigating the latency issues related to Explore data sync with our partners. We will update you when we have more pertinent information to share. Thank you so much for being so patient.
06:50 UTC | 23:50 PT
We are still getting to the root cause of the latency issues, which are also related to Explore data sync delays. We are working very closely with our partners to investigate further. Next update in 60 minutes.
05:50 UTC | 22:50 PT
Chat ticket creation and agent logins have now been resolved. Thank you so much for being so patient as we worked through those issues. We are currently working with our partners to identify increased latency issues. Next update in 60 minutes.
05:20 UTC | 22:20 PT
We can confirm recovery for ticket creation and agent logins in Chat. Please reach out if you still see issues. We are continuing Explore latency investigations and believe this could be related to potential network issues. More info in 30 minutes.
04:49 UTC | 21:49 PT
We have confirmed reports that ticket creation and agent logins in Chat are fixed. Please continue to reach out if you see recovery for ticket creation and agent logins. Investigations are ongoing with regarding latency impacting Explore. More info in 30 minutes.
04:20 UTC | 21:20 PT
We are seeing recovery on Chat, and are expecting ticket creation and agent logins to be successful. Please clear your browser cache and retry. With Explore data syncs, we are now investigating jointly with our CDN provider. More info in 30 mins.
03:51 UTC | 20:51 PT
We continue to investigate issues with ticket creation and agent login in Chat, as well as potential delays in data syncs in Explore. Investigations into potential impact to other products ongoing. Next update in 30 mins.
03:21 UTC | 20:21 PT
We have confirmed reports around issues with ticket creation and agent login in Chat. We are also investigating issues around Explore. More information in 30 mins.
03:03 UTC | 20:03 PT
We're investigating performance issues in Chat impacting agent login, ticket creation and other issues. More info to come.
POST-MORTEM
Root Cause Analysis
Due to an edge case in our Content Delivery Network (CDN) provider’s internal system IP management and related to DDoS mitigation tool, an automation on their end started to incorrectly flag some of our customers IPs in specific zones, as well as Zendesk’s own services, as potential DDoS, consequently blocking the traffic incoming from those wrongly flagged IPs.
This resulted in customers experiencing connectivity issues, timeouts, and being unable to access some of our services.
Resolution
To fix this issue, the CDN provider added Zendesk's IP zone to the automation’s allowlist so it would skip the mitigations at 07:43 UTC, July 9th, 2021.
Remediation Items
- CDN provider to fix internal automation to ensure that it correctly reads the IP addresses and does not start a hard mitigation process prior to confirming all details are listed for the correct IP zones.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.