SUMMARY
On August 9, 2024 from 15:46 UTC to 15:57 UTC, Support customers on Pod 17 experienced various issues such as error codes, slow loading times, and inability to open tickets or view messages within the product UI.
Timeline
August 09, 2024 04:13 PM UTC | August 09, 2024 09:13 AM PT
We're investigating reports of users being unable to view Support tickets on Pod 17 and are already seeing recovery. We will provide additional updates in 30 mins or sooner as we confirm full stability.
August 09, 2024 04:32 PM UTC | August 09, 2024 09:32 AM PT
From 15:46 UTC to 15:57 UTC, Support customers on Pod 17 experienced issues loading tickets. Performance has stabilized and we will continue to monitor performance. Next update in an hour or when we have new information.
August 09, 2024 04:51 PM UTC | August 09, 2024 09:51 AM PT
The Support performance issues that occurred on Pod 17 from 15:46 UTC to 15:57 UTC are now fully resolved. We apologize for any inconvenience caused and appreciate your patience.
POST-MORTEM
Root Cause Analysis
This incident was caused by the unexpected reboot of a system that speeds up data retrieval by caching information in memory. Due to an inadequate response to this failure, the Agent-graph component continued to wait up to 60 seconds for a response, causing timeout errors and resulting in 503 service errors. Contributing factors include that the system did not switch to an alternative data source in a timely manner, and the monitors in place did not trigger alerts because the issue was resolved before hitting the thresholds.
Resolution
To fix this issue, the system automatically recovered as the memory-caching system came back online. We identified that the reboot of this system caused the delays, and it was confirmed that the issue was self-resolving, requiring no immediate manual intervention to restore service.
Remediation Items
- Reduced timeout for user cache retrieval.
- Consider performing chaos testing to simulate such failures in a controlled environment.
- Review and adjust alert thresholds to ensure quicker detection and response time.
- Reach out to AWS to investigate the unexpected reboot of the memory-caching system to prevent similar future occurrences.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, contact Zendesk customer support.
1 comment
Jessica G.
Post-mortem published August 19, 2024.
0