SUMMARY
On April 26, 2023 from 16:40 UTC to 20:19 UTC, customer agents across multiple pods experienced disconnection issues when using Chat, Messaging, and Sunshine Conversation.
Timeline
17:19 UTC | 10:19 PT
We are receiving reports of disconnections in Chat across multiple pods. Investigation is underway and further updates will be posted soon.
17:20 UTC | 10:20 PT
We are receiving reports of disconnections in Chat across multiple pods. Investigation is underway and further updates will be posted soon.
17:30 UTC | 10:30 PT
We have confirmed an issue causing Chat disconnections across all pods and our team is working towards a root cause. Additional information will be posted shortly.
18:00 UTC | 11:00 PT
We are beginning to see recovery from the issue causing disconnections across Chat, Messaging, and Sunshine Conversations. We will monitor until resolution. Please let us know if you continue to experience issues.
18:18 UTC | 11:18 PT
We are happy to report that the issue causing disconnections across Chat, Messaging, and Sunshine Conversations has been resolved. Thank you for your patience during our investigation.
20:03 UTC | 13:03 PT
We are investigating a reoccurrence of the Chat disconnections service incident from earlier today that is impacting all pods and our team is currently investigating the issue. More details to follow.
20:50 UTC | 13:50 PT
We are beginning to see recovery from the issue causing disconnections across Chat, Messaging, and Sunshine Conversations. We will monitor until resolution. Please let us know if you continue to experience issues.
21:02 UTC | 14:02 PT
The issue causing disconnections across Chat, Messaging, and Sunshine Conversations is now fully resolved. Thank you for your continued patience and apologies for the disruption this may have cause you and your team.
POST-MORTEM
Root Cause Analysis
This incident was caused by a reboot of the Agent Workspace API during peak hours as part of a routine maintenance. This led to a temporary increase in load, which resulted in a domino effect on our service instances, exposing the insufficient underlying capacity of the system.
Resolution
A fix was deployed to address Chat reconnection issues, one that also mitigated future occurrences of this behaviour. After the deployment was done, we were able to observe recovery, and agents were able to reconnect successfully again.
Remediation Items
- Improved error handling for Chat session reconnections [Done]
- Fix Chat session handling issue when reconnecting to Agent Workspace API hosts [Done]
- Review deployment procedures for quicker turnaround of fixes [Scheduled]
- Update paging rules so relevant teams can be more quickly engaged for faster issue resolution [In progress]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.
1 Comments
Post-mortem published May 2, 2023.
Article is closed for comments.