SUMMARY
In June 2024, particularly on the 13th and 25th and a few days in July, a number of issues arose within Zendesk's Support Agent Workspace. These incidents disrupted agents' workflows, making it difficult for them to access tickets. The main issues they encountered included a "Messages not found" error and an "A_xxx" error code when trying to load tickets. These problems primarily occurred across various Pods on multiple days. Each disruption spike typically lasted an average of 2 minutes. Although customers could attempt to refresh the system as a workaround, they risked losing ongoing conversations in the process.
Timeline
June 25, 2024 04:05 PM UTC | June 25, 2024 09:05 AM PT
We are aware of a spike of errors impacting customers across multiple pods between 15:40 and 15:47 UTC on June 25, 2024, which resulted in "Messages not found" responses and "Error code A_xxx" messages when trying to load tickets in Support. We have recovered from these errors and the issue should be resolved by reloading your browser and/or clearing your cache and cookies.
POST-MORTEM
Root Cause Analysis
The main cause of these incidents was an unexpected increase in HTTP requests to the server, often at peak traffic times. This surge led to a "thundering herd" effect that overwhelmed the Agent Graph server connections, causing readiness probes to fail. Lotus, a vital component of the system, was identified as playing a significant role. It overloaded the Ticket Data Manager (TDM) with multiple requests each time it reconnected. This surge in traffic is primarily attributed to conversation state subscriptions reconnecting after mass disconnections seemingly due to Zorg/Nginx and/or subscription service deployments.
The TDM is mainly responsible for managing ticket data. It organizes and stores information when a ticket is generated and then retrieves and presents this data when an agent or a customer needs to access it, and acts as the master controller of all ticket-related data, ensuring seamless operations within the system.
Resolution
Preventive measures were implemented in response to these issues. These included the Connection and Request Rate Limiter, responsible for regulating incoming traffic. Concurrently, steps were taken to enhance the resilience of the Agent Graph during caching failures. This strategy served to prevent system-wide disruptions from inevitable technical glitches, acting like a backup generator during a power outage. While there were numerous mitigations put in place the actual remediation that concluded the service incident was a change made in Lotus. The change reduced the number of scenarios where re-fetching data would occur subsequently ending the thundering herd effect.
Edited July 25: After making some adjustments on July 10th to prevent a buildup of requests causing issues, we didn't see further spikes impacting the ticket UI. We continued to keep a close eye on things and were satisfied to find that things were running smoothly in the following days.
Also, over the previous month, we had noticed some performance dips on specific pods on Fridays - however, things were looking up on July 12th, which gave us more confidence in our changes. Following that on July 15th, we didn’t encounter any performance bumps or spikes, leading us to believe that the issue had been resolved.
Remediation Items
Additional strategies have been planned to further enhance system stability and prevent future disruptions:
- Alert for Readiness Probe Failures: Implement a smoke test to alert the technical team promptly about any potential issues, enabling swift action.
- Consideration of Fetching Patterns: Advise software developers to consider the volume and frequency of information retrieval carefully to avoid system imbalances.
- Establishing a Baseline of Requests: Determine the system’s capacity for dealing with simultaneous requests for ticket information to prevent a system breakdown.
- Space out the re-fetches: Introduce a jitter to mitigate thundering herd effects.
- Explore More Graceful Subscription Retention: Investigate ways to maintain subscriptions more effectively during deployments.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, contact Zendesk customer support.