Summary
On June 20, 2023 from 03:50 UTC to 07:06 UTC, Zendesk Support and Chat customers across all Pods may have experienced errors, disconnections, and delays when using certain Chat and Support (Agent Workspace only) features, including login, chat message delivery, reporting APIs, analytics and ticket/chat routing.
Timeline
04:24 UTC | 21:24 PT
We are investigating an issue impacting Messaging and Chat across all Pods. Symptoms may include missed chats, login failures and chat ticket routing delays. More info to come.
04:40 UTC | 21:40 PT
Our team continues to investigate issues across all Pods impacting multiple features in Chat and Messaging (Support Agent Workspace). Chat/message delays, login issues, message and chat routing and other features may be impacted. Next update in 30 minutes.
05:08 UTC | 22:08 PT
We are working with our database vendor to identify and resolve the issues impacting Chat and Messaging across all Pods. Thanks for your patience while we work through today's issues.
05:45 UTC | 22:45 PT
We continue to work with our vendor to resolve the Chat and Messaging issues. Partial recovery has been observed following a database failover, however issues remain on our provider’s end that is impacting customers. We will provide another update as soon as info is available.
06:26 UTC | 23:26 PT
We are observing significant recovery across all Pods for the impacted services in Chat and Messaging. We are turning our attention to backfilling chats/messages that were interrupted during the incident. A final update will be provided when we reach full recovery.
07:04 UTC | 00:04 PT
We have detected degradation that has resumed for some customers using Chat and Messaging across all Pods. We are putting every effort into resolving this as soon as possible with our service provider. Thanks for your continuing patience.
07:42 UTC | 00:42 PT
We continue to work on stabilising the issues with Chat and Messaging and backfilling messages, affecting all Pods. Partial recovery was observed earlier, however, we still saw errors across some services, which have begun to subside again. Another update in 1h or earlier.
08:29 UTC | 01:29 PT
We are observing increased stability as well as customers’ confirmation the situation has improved on their end with Chat and Messaging in All Pods. We continue working on backfilling information not updated during the incident. Monitoring continues until full resolution.
09:03 UTC | 02:03 PT
The team continues with work related to backfilling information and Messaging data update has been completed. Data related to Chat is still ongoing. There’s no ETA at this point but we’ll keep you informed as we have more details.
10:00 UTC | 03:00 PT
We are happy to confirm that multiple issues related to Chat and Messaging across all Pods have been resolved. Both related products have had data backfilled that didn’t update during the incident. We truly appreciate your patience while we worked on this.
Root Cause Analysis
This incident was caused by a regional outage of our hosting provider in the EU region affecting multiple Zendesk services.
Resolution
The first outage between 4:16 UTC and 5:08 UTC was resolved by Zendesk engineering performing a failover. This failover took longer than expected due to the ongoing regional outage. The second outage was from 6:56 UTC to 7:06 UTC. A second failover was initiated and resolved the issue.
Remediation Items
- Schedule leadership meeting between Zendesk and hosting provider to review root cause analysis and remediation items [Scheduled]
- Review automated recovery mechanisms for impacted systems [Scheduled]
- Improve consistent hashing mechanism to limit the impact for similar future incidents [Scheduled].
- Explore rate limiting mechanisms to limit impact for similar future incidents [Scheduled]
- Further develop recovery plan for cache and data store outages [Scheduled].
- Additional alerts and monitoring [Scheduled].
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.
1 Comments
Postmortem published June 26, 2023.
Article is closed for comments.