Service Incident - March 16th, 2021 - Pod 17 - Chat failing in Agent Workspace

SUMMARY

On March 16, 2021 from 14:15 UTC to 15:45 UTC, Zendesk Chat agents on Pod 17 experienced unexpected missed chats through Agent Workspace in Support.

Timeline

15:07 UTC | 08:07 PT
Our teams are investigating a delay in Chat serving via Agent Workspaces which is impacting Zendesk POD 17 customers. We will provide more information as it is available.

15:45 UTC | 08:45 PT
Our team is continuing to investigate an issue with agents not being served chats via Agent Workspaces. This is impacting Zendesk pod 17 customers. We are seeing some recovery and will provide more information as it becomes available.

16:28 UTC | 09:28 PT
Our team has identified the impacted infrastructure impacting Chat in Agent Workspaces for Zendesk customers on pod 17. We are moving toward full recovery and significant improvements to latency have been seen in the last 30 mins. We will provide updates as they become available.

17:08 UTC | 10:08 PT
The issue with Chat latency in Agent Workspaces on pod 17 is now resolved. Tickets will automatically be created from chats that were missed. Note you may see duplicate ticket creation as a result of our systems recovering. These can be merged or deleted as needed.

18:24 UTC | 11:24 PT
We continued to monitor the issues with Agent Workspace and ticket creation via Chat and Messaging channels on pod 17. We have confirmation that this issue is fully resolved at this time.

POST-MORTEM

Root Cause Analysis

This incident was caused by a bug in a storage controller in Zendesk’s infrastructure causing a race condition to be met resulting in storage volumes to be detached unexpectedly. When in-use volumes were detached, this resulted in read/writes to those volumes to fail preventing incoming chat sessions from connecting to online agents.

Resolution

To fix this issue, the detached volumes were reattached and services restarted leading to full service recovery.

During the outage period, a secondary backup system was automatically activated to create chat tickets from incoming chats. This eliminated the possibilities of any data loss. When the main system came back online, duplicate tickets may have been created by the backup system which is an expected behaviour.

Remediation Items

Agent Workspace remediations

Bump up the resources across all Pods for the service that uses persistent volumes, to improve recovery speed [COMPLETED]
Work on simplifying the service event processing to improve recovery speed [IN PROGRESS]

Infrastructure remediations

Improve monitoring and alerting for persistent volumes [TO DO]
Improve tooling to speed up future rollouts/rollbacks [TO DO]
Submit a patch to the upstream to resolve the race condition [TO DO]
Develop more robust testing around persistent volume management, to better catch issues in low-level infrastructure [TO DO]

FOR MORE INFORMATION

For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.