Summary
On May 11, 2021 from 18:18 UTC to 22:53 UTC, Agent Workspace (Support) customer agents were intermittently set to the ‘Away’ status when they accepted chats. Only Agent Workspace customers who were online and using chat in the earlier ‘May 11th, Agent Workspace/Chat degradation for some customers’ incident were impacted in this service incident.
Timeline
20:57 UTC | 13:57 PT
We are aware of issues for some Agent Workspace customers, such as incoming chats not showing up in dashboards and an inability to serve some chats. We suspect this is related to the earlier incident this morning. We will provide another update as we continue investigating.
21:27 UTC | 14:27 PT
We are still investigating chat issues with incoming chats not showing up in dashboards and an inability to serve some chats for some Agent Workspace customers. We will provide another update as we have more information.
21:57 UTC | 14:57 PT
We are still investigating chat issues with incoming chats not showing up in dashboards and an inability to serve chats for some Agent Workspace customers. We will provide another update as we have more information.
22:27 UTC | 15:27 PT
We are still investigating chat issues with incoming chats not showing up in dashboards and an inability to serve chats for some Agent Workspace customers. We will provide another update as we have more information.
23:47 UTC | 16:47 PT
Our team has deployed a fix to resolve the issues involving Chat and Agent Workspace. Please let us know if you continue to see any issues.
POST-MORTEM
Root Cause Analysis
This incident was caused by a capacity issue in one of the Chat caching clusters that is used to surface chats as tickets to agents. An inefficient memory cleanup was observed that led to the cluster running out of memory. In a previous incident, tickets could not be shown to agents for them to be able to serve the incoming chats. Some issues remained from the previous incident in the form of lingering chats; this resulted in new chats not being served to agents on accounts with the ‘reassignment + autoidle’ timeout settings enabled.
Resolution
To fix this issue, we performed a clean-up of lingering chats created during the incident. Recovery was observed thereafter.
Remediation Items
- Implement distinct caching instances for separate backend services to mitigate recurrences [Scheduled]
- Revise cache expiration (TTL) settings [Scheduled]
- Revise monitoring and alerting thresholds for memory usage on caching clusters [Scheduled]
- Upgrade cache clusters to current version [Scheduled]
- Update runbooks with cache cleanup process [Scheduled]
- Review all metrics with our caching partner and implement improvements [Scheduled]
- Improved logging mechanisms for more efficient issue identification [Scheduled]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.