Summary
On May 11, 2021 from 11:10 UTC to 15:40 UTC customers using Chat and Messaging through Agent Workspace (Support) were unable to accept new incoming chats. Standalone Chat customers experienced degradation in their chat analytics and chat history functionality, but were still able to answer and chat with customers.
Timeline
12:57 UTC | 05:57 PT
We are aware of issues for some Agent Workspace customers not seeing incoming chats showing up in their dashboards. Investigation is underway
13:20 UTC | 06:20 PT
We are investigating issues impacting a delay in Real Time Monitoring, messages that can’t be sent, chats not showing for agents and chats not routing to agents. We will continue to provide updates as they become available.
14:03 UTC | 07:03 PT
Our team is working to resolve an outage which is impacting Agent Workspace. Due to this incident, chats may not be visible in Zendesk Sell or Explore. We will provide an update as we learn more.
14:36 UTC | 07:36 PT
We are still working on the outage which is impacting multiple products, including Agent Workspace. Due to this incident, chats may not be visible in Zendesk Sell or Explore. More information soon.
15:08 UTC | 08:08 PT
We are continuing to work to resolve the outage impacting multiple products, including Agent Workspace. Due to this incident, chats may not be visible in Zendesk Sell or Explore. We will keep providing updates as soon as we have further information.
15:52 UTC | 08:52 PT
We are continuing to work on resolving the outage impacting multiple products, however we are seeing stability in Chat and Agent Workspace. Transcripts may be delayed in populating in tickets. We will continue to provide updates as soon as we have further information.
16:38 UTC | 09:38 PT
Our team has confirmed that the issue impacting Chat availability is now resolved. All transcripts should be present within tickets as well. Please let us know if you're still seeing this issue or if tickets are yet to have transcripts.
POST-MORTEM
Root Cause Analysis
This incident was caused by a capacity issue in one of the Chat caching clusters that is used to surface chats as tickets to agents. An inefficient memory cleanup was observed that led to the cluster running out of memory. As a result, tickets could not be shown to agents for them to be able to serve the incoming chats.
Resolution
To fix this issue, we had to increase capacity of the affected caching cluster.
Remediation Items
- Adding distinct caching instances for separate backend services to mitigate recurrences [Scheduled]
- Improve monitoring metrics for the caching cluster [Scheduled]
- Add documentation to improve understanding of the affected components for faster decision making [In Progress]
- Adding new metrics for monitoring Chat services [In Progress]
- Review all metrics with our caching partner and implement improvements [Scheduled]
- Improved logging mechanisms for more efficient issue identification [Scheduled]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.