Summary
On December 18, 2023 from 22:25 UTC to 23:18 UTC, Zendesk Chat and Support (Messaging) customers on all Pods may have experienced multiple issues including chat and messaging errors, disconnections, login issues and the inability to change agent status.
Timeline
23:03 UTC | 15:03 PT
We are investigating reports of Chat connectivity issues. We will provide another update in 15 minutes.
23:20 UTC | 15:20 PT
Our team continues to investigate issues across multiple Pods impacting multiple features in Chat, Social Chat, and Messaging (Support Agent Workspace). Chat/message delays, login issues, message and chat routing and other features may be impacted. Next update in 30 minutes.
23:42 UTC | 15:42 PT
Our engineers have restarted an unhealthy Chat server and are now seeing recovery. We will continue to monitor performance and provide another update when we have more information to share.
00:49 UTC | 16:49 PT
Chat and Messaging have now fully recovered from today's server issue. Our teams will continue to monitor performance and work to restore any recoverable historical chats that have not automatically recovered. We will send a final message when this work has been completed in the coming hours.
01:24 UTC | 17:24 PT
Our teams have restored all recoverable historical chats that were not recovered during yesterday's service disruption. Thanks a lot for your patience and apologies for the inconvenience this issue caused.
POST-MORTEM
Root Cause Analysis
This incident was caused by a single live chat host failure in our hosting provider’s infrastructure. This resulted in a disruption to the chat/messaging service for customers being served by this particular backend host.
Resolution
To fix this issue, our team restarted the affected host. Recovery of undelivered messages impacted during the outage was completed after the service was restored.
Remediation Items
- Improve recovery time when instance failure occurs by updating runbooks to initiate power cycle procedures earlier [Scheduled]
- Update tools access for on-call engineers [Scheduled]
- Introduce additional alerts to detect instance failures [In progress].
- Escalate priority of Pod account migrations to reduce impact radius [In progress].
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.
1 comment
Dan Beirouty
Post-mortem published December 22, 2023.
0