SUMMARY
On July 02, 2024 from 08:10 UTC to 16:30 UTC, customers across Pods 17 and 18 encountered an issue where the "Accept Chat" button was unresponsive. The issue then spread, causing customers in multiple other Pods to experience a "Couldn't connect to server" error when attempting to open tickets. This caused significant disruption to their ability to communicate and manage tasks effectively within Agent Workspace.
Timeline
July 02, 2024 12:12 PM UTC | July 02, 2024 05:12 AM PT
We are currently investigation reports around the chat accept button for some customers on Pods 17 and 18 not working. We will provide another update when we have more information.
July 02, 2024 02:01 PM UTC | July 02, 2024 07:01 AM PT
We are going through different Chat/Messaging/AgentWorkspace issues at the moment and continue investigating all problems. We appreciate your patience.
July 02, 2024 02:51 PM UTC | July 02, 2024 07:51 AM PT
We continue to address issues affecting Chat & Messaging acceptance in Agent Workspace for customers across Pods 17 and 18 who face the Chat Accept button not working. We are exploring fixes and testing options to fully resolve this issue.
July 02, 2024 03:28 PM UTC | July 02, 2024 08:28 AM PT
We are still investigating the root cause for the issue affecting Chat & Messaging acceptance in Agent Workspace for customers on Pods 17 & 18 preventing use of the "Accept Chat" button. We will post additional information in one hour or when we have new information to share.
July 02, 2024 04:28 PM UTC | July 02, 2024 09:28 AM PT
Our team continues to investigate the issue affecting Chat & Messaging acceptance in Agent Workspace for customers on Pods 17 & 18 preventing use of the "Accept Chat" button. We will provide further updates in one hour or when we have new information to share.
July 02, 2024 05:48 PM UTC | July 02, 2024 10:48 AM PT
We have increased capacity on Messaging services in Pods 17 & 18 and are monitoring for any additional impact. Our team will ensure that no further issues are seen when accepting chats, and we will provide additional updates as we confirm recovery. Please let us know if you continue to experience any issues with accepting chats.
July 03, 2024 05:05 AM UTC | July 02, 2024 10:05 PM PT
After further monitoring, we have confirmation that the issue impacting Chat and Messaging acceptance has been resolved. Many thanks for your patience as we got to this point.
POST-MORTEM
Root Cause Analysis
During an upgrade to our updated storage system, we encountered unforeseen performance challenges, translating into a lag in delivering timely updates. The difficulties were largely due to issues processing queries for connection and subscription lifecycles, resulting in storage system blockages and stalled transactions. These complications impaired the performance of our system component responsible for managing data and facilitating real-time user interface updates. When we tried streamlining our process by focusing solely on the updated storage system, an unexpected surge in processing power usage further strained our resources.
Resolution
In order to resolve the issue, we implemented a multi-pronged approach. We increased the size of the database clusters across all pods and identified that database locks and blocked transactions were at the root of the performance issues. In response, we applied a fast fix to eliminate these locks, despite it potentially leading to orphaned database objects. Finally, we undertook a gradual rollback which ultimately led to the stabilization of the subscription service.
Remediation Items
- Removal of DB locks and cleanup of orphaned subscriptions were completed.
- Further measures include adding Service Level Objectives (SLOs) for connection creation and subscription creation endpoints. This is to monitor and ensure reliable system performance in the future.
- Discussion of soak time in the first production pod after the canary pod to catch similar issues earlier.
- Staging load tests and maintenance practices involving the cleanup and recreation of clusters would be adopted to ensure the system functions optimally.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, contact Zendesk customer support.
1 comment
Jessica G.
Post-mortem published July 29, 2024.
0