Service Incident - March 16th, 2021 - Pod 23 - Chats failing in Agent Workspace

SUMMARY

On March 16, 2021 from 16:15 UTC to 18:07 UTC, Zendesk Chat agents on Pod 23 experienced unexpected missed chats through Agent Workspace in Support.

Timeline

17:38 UTC | 10:38 PT
From 16:53 UTC - 17:20 UTC, Support and Chat Agent workspace customers on Pod 23 experienced missed chats, delayed ticket creation, and duplicate tickets created. Our systems have stabilized and we are continuing to monitor performance.

18:17 UTC | 11:17 PT
We continue to monitor an incident that took place from 16:53 UTC - 17:20 UTC and impacted Support and Chat Agent workspace customers on pod 23. We have reports that inbound email delivery may also have been delayed. Our systems remain stable at this time.

19:13 UTC | 12:12 PT
The issue impacting Support and Chat Agent workspace customers on pod 23 has been resolved. Thanks for your patience through this extended incident. If you have any further issues, please reach out to us.

POST-MORTEM

Root Cause Analysis

This incident was caused by a bug in a storage controller manager in Zendesk’s infrastructure causing a race condition to be met resulting in storage volumes to be detached unexpectedly. When in-use volumes were detached, this resulted in read/writes to those volumes to fail preventing incoming chat sessions from connecting to online agents.

Resolution

To fix this issue, the detached volumes were reattached and services restarted leading to full service recovery. Full resolution was achieved post-incident by rolling back the culprit controller manager version to a previous working version.

During the outage period, a secondary backup system was automatically activated to create chat tickets from incoming chats. This eliminated the possibilities of any data loss. When the main system came back online, duplicate tickets may have been created by the backup system which is an expected behaviour.

Remediation Items

Agent Workspace remediations

Bump up the resources across all Pods for the service that uses persistent volumes, to improve recovery speed [COMPLETED]
Work on simplifying the service event processing to improve recovery speed [IN PROGRESS]

Infrastructure remediations

Improve monitoring and alerting for persistent volumes [TO DO]
Improve tooling to speed up future rollouts/rollbacks [TO DO]
Submit a patch to the upstream to resolve the race condition [TO DO]

Develop more robust testing around persistent volume management, to better catch issues in EBS drivers and other low-level infrastructure [TO DO]

FOR MORE INFORMATION

For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.