SUMMARY
On October 19, 2021 21:16 UTC to 22:34 UTC, customers experienced possible chat issues in one of these following ways:
- New chats were not presented to agents
- Delays when trying to accept chats
- Chats were not saving
- Agents may have been moved to an Invisible status in addition to chat routing issues
- Ongoing chats would not have been affected
- Chats that were not saved immediately were backfilled at a later time.
Timeline
22:06 UTC | 15:06 PT
The ability to chat with Zendesk support agents at support.zendesk.com is also currently impacted.
22:21 UTC | 15:21 PT
We are investigating reports of Chat routing delays. More information to follow.
22:27 UTC | 15:27 PT
We have deployed a fix and are beginning to see recovery. Chats that were not routed during the degradation will show as missed chats if the chat visitor left.
22:53 UTC | 15:53 PT
The Chat degradation causing Chat routing delays and chats not to be saved is now fully resolved. Chats that were not routed during the degradation will show as missed chats if the chat visitor left.
23:06 UTC | 16:06 PT
The Chat degradation causing Chat routing delays and chats not to be saved is now fully resolved. Chats that were not routed during the degradation will show as missed chats if the chat visitor left.
POST-MORTEM
Root Cause Analysis
This incident was caused by the Chat Routing Service (CRS) getting into a stuck state as a result of messages that could not be processed. A secondary root cause was that this was a by-product of an untested scenario, due to the permutations and complexities of settings the system allows.
Resolution
To fix this issue, we released a hotfix and recovery was observed thereafter.
Remediation Items
- Improve the Chat Routing Service to handle the above scenario gracefully [In progress]
- Update internal documentation [Completed]
- Review chat alerting mechanisms [Completed]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.