SUMMARY
In June 2024, particularly on the 13th and 25th and a few days in July, a number of issues arose within Zendesk's Support Agent Workspace. These incidents disrupted agents' workflows, making it difficult for them to access tickets. The main issues they encountered included a "Messages not found" error and an "A_xxx" error code when trying to load tickets. These problems primarily occurred across various Pods on multiple days. Each disruption spike typically lasted an average of 2 minutes. Although customers could attempt to refresh the system as a workaround, they risked losing ongoing conversations in the process.
Timeline
June 13, 2024 01:25 PM UTC | June 13, 2024 06:25 AM PT
We are aware of two spikes in errors impacting customers across multiple Pods where they would see the “Messages not found” Error code A_xxx when trying to load a ticket in Support. These issues have subsided, and you should be able to access tickets without further problems after reloading your browser and/or clearing your cache and cookies. We continue to investigate the cause of these errors and appreciate your patience.
June 14, 2024 07:21 PM UTC | June 14, 2024 12:21 PM PT
The recent issue causing spikes in errors across multiple pods, where users encountered the "Messages not found" Error code A_xxx while attempting to load tickets in Support, has been fully resolved. You should now be able to access tickets without any further problems. If you experience any lingering issues, please try reloading your browser and/or clearing your cache and cookies.
POST-MORTEM
Root Cause Analysis
The main cause of these incidents was an unexpected increase in HTTP requests to the server, often at peak traffic times. This surge led to a "thundering herd" effect that overwhelmed the Agent Graph server connections, causing readiness probes to fail. Lotus, a vital component of the system, was identified as playing a significant role. It overloaded the Ticket Data Manager (TDM) with multiple requests each time it reconnected. This surge in traffic is primarily attributed to conversation state subscriptions reconnecting after mass disconnections seemingly due to Zorg/Nginx and/or subscription service deployments.
The TDM is mainly responsible for managing ticket data. It organizes and stores information when a ticket is generated and then retrieves and presents this data when an agent or a customer needs to access it, and acts as the master controller of all ticket-related data, ensuring seamless operations within the system.
Resolution
Preventive measures were implemented in response to these issues. These included the Connection and Request Rate Limiter, responsible for regulating incoming traffic. Concurrently, steps were taken to enhance the resilience of the Agent Graph during caching failures. This strategy served to prevent system-wide disruptions from inevitable technical glitches, acting like a backup generator during a power outage. While there were numerous mitigations put in place the actual remediation that concluded the service incident was a change made in Lotus. The change reduced the number of scenarios where re-fetching data would occur subsequently ending the thundering herd effect.
Edited July 25: After making some adjustments on July 10th to prevent a buildup of requests causing issues, we didn't see further spikes impacting the ticket UI. We continued to keep a close eye on things and were satisfied to find that things were running smoothly in the following days.
Also, over the previous month, we had noticed some performance dips on specific pods on Fridays - however, things were looking up on July 12th, which gave us more confidence in our changes. Following that on July 15th, we didn’t encounter any performance bumps or spikes, leading us to believe that the issue had been resolved.
Remediation Items
Additional strategies have been planned to further enhance system stability and prevent future disruptions:
- Alert for Readiness Probe Failures: Implement a smoke test to alert the technical team promptly about any potential issues, enabling swift action.
- Consideration of Fetching Patterns: Advise software developers to consider the volume and frequency of information retrieval carefully to avoid system imbalances.
- Establishing a Baseline of Requests: Determine the system’s capacity for dealing with simultaneous requests for ticket information to prevent a system breakdown.
- Space out the re-fetches: Introduce a jitter to mitigate thundering herd effects.
- Explore More Graceful Subscription Retention: Investigate ways to maintain subscriptions more effectively during deployments.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, contact Zendesk customer support.
1 comment
Jessica G.
Post-mortem published July 24, 2024.
0